Related
I need some help with grouping data by continuous values.
If I have this data.table
dt <- data.table::data.table( a = c(1,1,1,2,2,2,2,1,1,2), b = seq(1:10), c = seq(1:10)+1 )
a b c
1: 1 1 2
2: 1 2 3
3: 1 3 4
4: 2 4 5
5: 2 5 6
6: 2 6 7
7: 2 7 8
8: 1 8 9
9: 1 9 10
10: 2 10 11
I need a group for every following equal values in column a. Of this group i need the first (also min possible) value of column b and the last (also max possible) value of column c.
Like this:
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
Thank you very much for your help. I do not get it solved alone.
Probably we can try
> dt[, .(a = a[1], b = b[1], c = c[.N]), rleid(a)][, -1]
a b c
1: 1 1 4
2: 2 4 8
3: 1 8 10
4: 2 10 11
An option with dplyr
library(dplyr)
dt %>%
group_by(grp = cumsum(c(TRUE, diff(a) != 0))) %>%
summarise(across(a:b, first), c = last(c)) %>%
select(-grp)
-output
# A tibble: 4 × 3
a b c
<dbl> <int> <dbl>
1 1 1 4
2 2 4 8
3 1 8 10
4 2 10 11
I am trying to calculate the mean of time by keeping all the variables in the final dataset within dplyr package.
Here how my sample dataset looks like:
library(dplyr)
id <- c(1,1,1,1, 2,2,2,2, 3,3,3,3, 4,4,4,4)
gender <- c(1,1,1,1, 2,2,2,2, 2,2,2,2, 1,1,1,1)
item.id <-c(1,1,1,2, 1,1,2,2, 1,2,3,4, 1,2,2,3)
sequence<-c(1,2,3,1, 1,2,1,2, 1,1,1,1, 1,1,2,1)
time <- c(5,6,7,1, 2,3,4,9, 1,2,3,9, 5,6,7,8)
data <- data.frame(id, gender, item.id, sequence, time)
> data
id gender item.id sequence time
1 1 1 1 1 5
2 1 1 1 2 6
3 1 1 1 3 7
4 1 1 2 1 1
5 2 2 1 1 2
6 2 2 1 2 3
7 2 2 2 1 4
8 2 2 2 2 9
9 3 2 1 1 1
10 3 2 2 1 2
11 3 2 3 1 3
12 3 2 4 1 9
13 4 1 1 1 5
14 4 1 2 1 6
15 4 1 2 2 7
16 4 1 3 1 8
id for student id, gender for gender, item.id for the question ids students take, sequence is the sequence number of attempts to solve the question because students might return back to questions and try to answer again, and time is the time spent on each trial.
When calculating the mean of the time, I need to follow three steps:
(a) students have multiple trials for each question. I need to calculate the mean of the time for each item having multiple trials.
(b) then calculate the overall mean of the time for each id. For example, for id=1, I have two items, the first item has 3 trials and the second item has 1 trial. First I need to aggregate the time for the first item by (5+6+7)/3=6, so id=1 has item1 time 6 and item2 time 1. Second, taking 6 and 1 and calculating the mean for this student (6+1)/2=3.5.
(c) Lastly, I would like to keep all the variables in the dataset.
data <- data %>%
group_by(id) %>%
select(id, gender, item.id, sequence, time) %>%
summarize(mean.time = mean(time))
I got this but obviously this is only aggregating the mean by not taking into account of the within mean for each trial and this also does not keep all the variables:
> data
# A tibble: 4 x 2
id mean.time
<dbl> <dbl>
1 1 4.75
2 2 4.5
3 3 3.75
4 4 6.5
I thought select() was going to keep all variables.
The final dataset should look like this below:
> data
id gender item.id sequence time mean.time
1 1 1 1 1 5 3.5
2 1 1 1 2 6 3.5
3 1 1 1 3 7 3.5
4 1 1 2 1 1 3.5
5 2 2 1 1 2 4.5
6 2 2 1 2 3 4.5
7 2 2 2 1 4 4.5
8 2 2 2 2 5 4.5
9 3 2 1 1 1 3.75
10 3 2 2 1 2 3.75
11 3 2 3 1 3 3.75
12 3 2 4 1 9 3.75
13 4 1 1 1 5 6.5
14 4 1 2 1 6 6.5
15 4 1 2 2 7 6.5
16 4 1 3 1 8 6.5
I used dplyr but open any other solutions.
Thanks in advance!
We can use mutate instead of summarise as summarise returns a summarised output of 1 row per each group, while mutate creates a new column in the dataset
...
%>%
mutate(mean.time = mean(time))
If wee want to get the mean of mean, then first group by 'id', 'item.id', get the mean, and then grouped by 'id', get the mean of unique elements
data %>%
group_by(id, item.id) %>%
mutate(mean.time = mean(time)) %>%
group_by(id) %>%
mutate(mean.time = mean(unique(mean.time)))
# A tibble: 16 x 6
# Groups: id [4]
# id gender item.id sequence time mean.time
# <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 1 1 1 1 5 3.5
# 2 1 1 1 2 6 3.5
# 3 1 1 1 3 7 3.5
# 4 1 1 2 1 1 3.5
# 5 2 2 1 1 2 4.5
# 6 2 2 1 2 3 4.5
# 7 2 2 2 1 4 4.5
# 8 2 2 2 2 9 4.5
# 9 3 2 1 1 1 3.75
#10 3 2 2 1 2 3.75
#11 3 2 3 1 3 3.75
#12 3 2 4 1 9 3.75
#13 4 1 1 1 5 6.5
#14 4 1 2 1 6 6.5
#15 4 1 2 2 7 6.5
#16 4 1 3 1 8 6.5
Or instead of creating a second group by, we can do a match to get the first position of 'item.id', extract the 'mean.time' and get the mean
data %>%
group_by(id, item.id) %>%
mutate(mean.time = mean(time),
mean.time = mean(mean.time[match(unique(item.id), item.id)]))
Or use summarise and then do a left_join
data %>%
group_by(id, item.id) %>%
summarise(mean.time = mean(time)) %>%
group_by(id) %>%
summarise(mean.time = mean(mean.time)) %>%
right_join(data)
I have a problem with moving the rows to one upper row. When the rows become completely NA I would like to flush those rows (see the pic below). My current approach for this solution however still keeping the second rows.
Here is my approach
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
> data
gr A B C
1 1 1 NA 1
2 1 NA 1 NA
3 2 2 NA 4
4 2 NA 3 NA
5 3 4 NA 5
6 3 NA 7 NA
so using this approach
data.frame(apply(data,2,function(x){x[complete.cases(x)]}))
gr A B C
1 1 1 1 1
2 1 2 3 4
3 2 4 7 5
4 2 1 1 1
5 3 2 3 4
6 3 4 7 5
As we can see still I am having the second rows in each group!
The expected output
> data
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
thanks!
If there's at most one valid value per gr, you can use na.omit then take the first value from it:
data %>% group_by(gr) %>% summarise_all(~ na.omit(.)[1])
# [1] is optional depending on your actual data
# A tibble: 3 x 4
# gr A B C
# <int> <dbl> <dbl> <dbl>
#1 1 1 1 1
#2 2 2 3 4
#3 3 4 7 5
You can do it with dplyr like this:
data$ind <- rep(c(1,2), replace=TRUE)
data %>% fill(A,B,C) %>% filter(ind == 2) %>% mutate(ind=NULL)
gr A B C
1 1 1 1 1
2 2 2 3 4
3 3 4 7 5
Depending on how consistent your full data is, this may need to be adjusted.
One more solution using data.table:-
data <- data.frame(gr=c(rep(1:3,each=2)),A=c(1,NA,2,NA,4,NA), B=c(NA,1,NA,3,NA,7),C=c(1,NA,4,NA,5,NA))
library(data.table)
library(zoo)
setDT(data)
data[, A := na.locf(A), by = gr]
data[, B := na.locf(B), by = gr]
data[, C := na.locf(C), by = gr]
data <- unique(data)
data
gr A B C
1: 1 1 1 1
2: 2 2 3 4
3: 3 4 7 5
I have the following data structure:
Player Team Round Question Answer
1: 2 1 1 1 1
2: 5 1 1 1 1
3: 8 1 1 1 1
4: 9 1 1 1 1
5: 10 1 1 1 1
6: 2 1 1 2 4
7: 5 1 1 2 5
8: 8 1 1 2 5
9: 9 1 1 2 5
10: 10 1 1 2 5
11: 2 1 1 4 4
12: 5 1 1 4 3
13: 8 1 1 4 4
14: 9 1 1 4 2
15: 10 1 1 4 4
16: ...
So there are several players from several team, answering several questions. There are always 2 rounds of games.
What I try to calculate is the medium and the agreement coefficient (see agrmt package) from the data by grouping the Team and the Question.
The result should look like this:
Team Question Median_R1 Agrmt_R1 Median_R2 Agrmt_R2
1: 1 1 1 1 1 1
2: 1 2 2 0.83 1 1
3: ...
4: 5 10 4 1 4 1
Does someone know if this is possible? I could not find a solution for this. I can solve the median and the agreement coefficient standalone, but not combined?
Every hint is welcome. Thank you very much.
UPDATE:
The agreement function returns a coefficient between -1 and 1. The values represent.
1 represents a full agreement (e.g. if every player answers 5).
0 would be, if every player has a different answer.
-1 would be, if a disagreement exists (some players say answer 1 and others say 5)
Compared to the median, the agreement functions takes a vector of the frequency vector.
For example, we have the following answers
Player Team Round Question Answer
6: 2 1 1 2 4
7: 5 1 1 2 5
8: 8 1 1 2 5
9: 9 1 1 2 5
10: 10 1 1 2 5
The function inputs would look like this:
Median input: 4,5,5,5,5 --> Result: 5
Agreement input: 0,0,0,1,4 --> Result: 0.9
UPDATE 2: SOLVED
The calculation of the agreement could be done with the following code:
agreement(table(factor(x, levels=1:5)))
The final is based on #sandipan implementation. I had to add another sorting step in order to combine the right data.frames:
library(agrmt)
df1 <- unique(df[c('Party', 'Question')])
for (df.R in split(df, df$Round)) {
round <- unique(df.R$Round)
# get the data.frame of the current Round.
df2 <- as.data.frame(as.list(aggregate(Answer ~ Party + Question + Round,
df.R, FUN = function(x) c(Median = median(x), Agrmt = agreement(table(factor(x, levels=1:5)))))))
# sort it and take only the columns of median and agreement
df3 <- df2[with(df2, order(Party, Question)),][4:5]
names(df3) <- c(paste('Median_R', round, sep=''), paste('Agrmt_R', round, sep=''))
df1 <- cbind.data.frame(df1, df3)
}
df1
Thank you all for the help.
Here are three approaches: base R aggregate, dplyr, and data.table.
With base R aggregate:
library(agrmt)
aggregate(Answer ~ Team + Round + Question, data=dat,
FUN = function(x) {
c(Median=median(x),
Agreement=agreement(table(factor(x, levels=1:5))))
})
Team Round Question Answer.Median Answer.Agreement
1 1 1 1 1.0 1.0
2 1 1 2 5.0 0.9
3 1 1 4 4.0 0.7
With dplyr:
library(dplyr)
dat.summary = dat %>% group_by(Team, Round, Question) %>%
summarise(Median=median(Answer),
Agreement=agreement(table(factor(Answer, levels=1:5))))
Team Round Question Median Agreement
1 1 1 1 1 1.0
2 1 1 2 5 0.9
3 1 1 4 4 0.7
With data.table:
library(data.table)
dat.summary = setDT(dat)[, list(Median=median(Answer),
Agreement=agreement(table(factor(Answer, levels=1:5)))),
by=list(Team, Round, Question)]
Team Round Question Median Agreement
1: 1 1 1 1 1.0
2: 1 1 2 5 0.9
3: 1 1 4 4 0.7
To get a "wide" data frame as the final output:
In the examples above, I've left the output in "long" format. If you want to reshape to "wide" format, so that each Round get its own set of columns, you can do as follows:
First, let's add a second Round to the sample data by stacking another copy of the sample data:
library(dplyr)
library(reshape2)
library(agrmt)
dat = bind_rows(dat, dat %>% mutate(Round=2))
Now calculate the median and agreement with the same code we used before in the dplyr example:
dat.summary = dat %>%
group_by(Team, Round, Question) %>%
summarise(Median=median(Answer),
Agreement=agreement(table(factor(Answer, levels=1:5))))
Finally, reshape to wide format. This requires first "melting" the data to stack the Median and Agreement columns into a single column, and then casting to wide format. We also include the second line of code to add "Round" to each Round so that we get the column names we want in the wide data frame:
dat.summary = dat.summary %>%
mutate(Round = paste0("Round", Round)) %>%
melt(id.var=c("Team","Question","Round")) %>%
dcast(Team + Question ~ variable + Round, value.var="value")
Team Question Median_Round1 Median_Round2 Agreement_Round1 Agreement_Round2
1 1 1 1 1 1.0 1.0
2 1 2 5 5 0.9 0.9
3 1 4 4 4 0.7 0.7
I guess you want something as follows, right?
df
Player Team Round Question Answer
1: 2 1 1 1 1
2: 5 1 1 1 1
3: 8 1 1 1 1
4: 9 1 1 1 1
5: 10 1 1 1 1
6: 2 1 1 2 4
7: 5 1 1 2 5
8: 8 1 1 2 5
9: 9 1 1 2 5
10: 10 1 1 2 5
11: 2 1 1 4 4
12: 5 1 1 4 3
13: 8 1 1 4 4
14: 9 1 1 4 2
15: 10 1 1 4 4
16: 2 1 2 1 2
17: 5 1 2 1 3
18: 8 1 2 1 4
19: 2 1 2 2 5
20: 5 1 2 2 3
21: 8 1 2 2 1
22: 2 1 2 4 6
23: 5 1 2 4 1
24: 8 1 2 4 5
library(agrmt)
df1 <- unique(df[c('Team', 'Question')])
for (df.R in split(df, df$Round)) {
round <- unique(df.R$Round)
df2 <- as.data.frame(as.list(aggregate(Answer ~ Team + Question + Round,
df.R, FUN = function(x) c(Median = median(x), Agrmt = agreement(x)))))[4:5]
names(df2) <- c(paste('Median_R', round, sep=''), paste('Agrmt_R', round, sep=''))
df1 <- cbind.data.frame(df1, df2)
}
df1
Team Question Median_R1 Agrmt_R1 Median_R2 Agrmt_R2
1: 1 1 1 0.00000000 3 0.2222222
6: 1 2 5 0.04166667 3 0.4444444
11: 1 4 4 -0.05882353 5 -0.5833333
How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3