Lagged function within group - r

I would like to write code to compute within each group, sum of lagged differences as shown in the table below:
ID x rank U R Required Output Value
1 1 1 U1 R1 -
1 1 2 U2 R2 R2-U1
1 1 3 U3 R3 (R3-U2) + (R3-U1)
1 1 4 U4 R4 (R4-U3) + (R4-U2) + (R4-U1)
1 0 5 U5 R5 R5
1 0 6 U6 R6 R6
1 0 7 U7 R7 R7
2 1 1 U8 R8 -
2 1 2 U9 R9 R9-U8
2 1 3 U10 R10 (R10-U9) + (R10 - U8)
2 1 4 U11 R11 (R11-U10) + (R11 - U9) + (R11 - U8)
3 1 1 U12 R12 -
3 0 2 U13 R13 R13
3 0 3 U14 R14 R14
ID is the unique group identifier. x is a bool and depending on its value the required output is either sum of difference with previous values or same period value. "rank" is a rank ordering column and the maximum rank can vary within each group. "U" and "R" are the main columns of interest.
To give a numerical example, I need the following:
ID x rank U R Required Output Value
1 1 1 10 7 -
1 1 2 9 11 1
1 1 3 10 10 1 + 0 = 1
1 1 4 11 13 3+4+3 = 10
1 0 5 7 8 8
1 0 6 8 8 8
1 0 7 5 7 7
2 1 1 3 2 -
2 1 2 9 15 12
2 1 3 13 14 16
2 1 4 1 14 17
3 1 1 12 1 -
3 0 2 14 9 9
3 0 3 1 11 11
R code to generate this table:
ID = c(rep(1,7),rep(2,4),rep(3,3))
x = c(rep(1,4),rep(0,3),rep(1,5),rep(0,2))
rank = c(1:7,1:4,1:3)
U = c(10,9,10,11,7,8,5,3,9,13,1,12,14,1)
R = c(7,11,10,13,8,8,7,2,15,14,14,1,9,11)
dat = cbind(ID,x,rank,U,R)
colnames(dat)=c("ID","x","rank","U","R")

Here's a tidyverse solution:
library(dplyr)
library(tidyr)
dat %>%
as_tibble() %>%
group_by(ID) %>%
mutate(output = ifelse(x, lag(rank) * R - lag(cumsum(U)), R))
Result:
# A tibble: 14 x 6
# Groups: ID [3]
ID x rank U R output
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 1 1 1 10 7 NA
2 1 1 2 9 11 1
3 1 1 3 10 10 1
4 1 1 4 11 13 10
5 1 0 5 7 8 8
6 1 0 6 8 8 8
7 1 0 7 5 7 7
8 2 1 1 3 2 NA
9 2 1 2 9 15 12
10 2 1 3 13 14 16
11 2 1 4 1 14 17
12 3 1 1 12 1 NA
13 3 0 2 14 9 9
14 3 0 3 1 11 11

Here is a base R solution using ave
dat <- within(dat,output <- ave(R,ID,x, FUN = function(v) v*(seq(v)-1))-ave(U,ID,x, FUN = function(v) c(NA,cumsum(v)[-length(v)])))
dat <- within(dat, output <- ifelse(x==0,R,output))
such that
> dat
ID x rank U R output
1 1 1 1 10 7 NA
2 1 1 2 9 11 1
3 1 1 3 10 10 1
4 1 1 4 11 13 10
5 1 0 5 7 8 8
6 1 0 6 8 8 8
7 1 0 7 5 7 7
8 2 1 1 3 2 NA
9 2 1 2 9 15 12
10 2 1 3 13 14 16
11 2 1 4 1 14 17
12 3 1 1 12 1 NA
13 3 0 2 14 9 9
14 3 0 3 1 11 11

Related

Create column with ID starting at 1 and increment when value in another column changes in R

I have a data frame like so:
ID <- c('A','A','A','A','A','A','A','A','A','A','A','B','B','B','B','B')
val1 <- c(0,1,2,3,4,5,6,7,8,9,10,11,0,1,2,3)
val2 <- c(0,1,2,3,4,5,0,1,0,1,2,0,1,0,1,2)
df <- data.frame(ID, val1, val2)
Output:
ID val1 val2
1 A 0 0
2 A 1 1
3 A 2 2
4 A 3 3
5 A 4 4
6 A 5 5
7 A 6 0
8 A 7 1
9 A 8 0
10 A 9 1
11 A 10 2
12 B 11 0
13 B 0 1
14 B 1 0
15 B 2 1
16 B 3 2
I am trying to create a third column (val 3) which is like an index. When val1 = 0 and val 2 = 0 it should be 1 (this is also grouped by ID). It should stay as one and then increment by 1 until val2 = 0 again, like so showing desired output:
ID val1 val2 val3
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2
How can this be achieved? I tried:
df <- df %>%
group_by(ID, val2) %>%
mutate(val3 = row_number())
And:
df$val3 <- cumsum(c(1,diff(df$val2)==0))
But neither provide the desired outcome.
Inside cumsum use the logical comparison val2==0
df %>%
group_by(ID) %>%
mutate(val3 = cumsum(val2==0))
# A tibble: 16 × 4
# Groups: ID [2]
ID val1 val2 val3
<chr> <dbl> <dbl> <int>
1 A 0 0 1
2 A 1 1 1
3 A 2 2 1
4 A 3 3 1
5 A 4 4 1
6 A 5 5 1
7 A 6 0 2
8 A 7 1 2
9 A 8 0 3
10 A 9 1 3
11 A 10 2 3
12 B 11 0 1
13 B 0 1 1
14 B 1 0 2
15 B 2 1 2
16 B 3 2 2

Replace row value in a data frame group by the smallest value in that group

I have the following data set:
time <- c(0,1,2,3,4,5,0,1,2,3,4,5,0,1,2,3,4,5)
value <- c(10,8,6,5,3,2,12,10,6,5,4,2,20,15,16,9,2,2)
group <- c(1,1,1,1,1,1,2,2,2,2,2,2,3,3,3,3,3,3)
data <- data.frame(time, value, group)
I want to create a new column called data$diff that is equal to data$value minus the value of data$value when data$time == 0 within each group.
I am beginning with the following code
for(i in 1:nrow(data)){
for(n in 1:max(data$group)){
if(data$group[i] == n) {
data$diff[i] <- ???????
}
}
}
But cannot figure out what to put in place of the question marks. The desired output would be this table: https://i.stack.imgur.com/1bAKj.png
Any thoughts are appreciated.
Since in your example data$time == 0 is always the first element of the group, you can use this data.table approach.
library(data.table)
setDT(data)
data[, diff := value[1] - value, by = group]
In case that data$time == 0 is not the first element in each group you can use this:
data[, diff := value[time==0] - value, by = group]
Output:
> data
time value group diff
1: 0 10 1 0
2: 1 8 1 2
3: 2 6 1 4
4: 3 5 1 5
5: 4 3 1 7
6: 5 2 1 8
7: 0 12 2 0
8: 1 10 2 2
9: 2 6 2 6
10: 3 5 2 7
11: 4 4 2 8
12: 5 2 2 10
13: 0 20 3 0
14: 1 15 3 5
15: 2 16 3 4
16: 3 9 3 11
17: 4 2 3 18
18: 5 2 3 18
Here is a base R approach.
within(data, diff <- ave(
seq_along(value), group,
FUN = \(i) value[i][time[i] == 0] - value[i]
))
Output
time value group diff
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
Here is a short way to do it with dplyr.
library(dplyr)
data %>%
group_by(group) %>%
mutate(diff = value[which(time == 0)] - value)
Which gives
# Groups: group [3]
time value group diff
<dbl> <dbl> <dbl> <dbl>
1 0 10 1 0
2 1 8 1 2
3 2 6 1 4
4 3 5 1 5
5 4 3 1 7
6 5 2 1 8
7 0 12 2 0
8 1 10 2 2
9 2 6 2 6
10 3 5 2 7
11 4 4 2 8
12 5 2 2 10
13 0 20 3 0
14 1 15 3 5
15 2 16 3 4
16 3 9 3 11
17 4 2 3 18
18 5 2 3 18
library(dplyr)
vals2use <- data %>%
group_by(group) %>%
filter(time==0) %>%
select(c(2,3)) %>%
rename(value4diff=value)
dataNew <- merge(data, vals2use, all=T)
dataNew$diff <- dataNew$value4diff-dataNew$value
dataNew <- dataNew[,c(1,2,3,5)]
dataNew
group time value diff
1 1 0 10 0
2 1 1 8 2
3 1 2 6 4
4 1 3 5 5
5 1 4 3 7
6 1 5 2 8
7 2 0 12 0
8 2 1 10 2
9 2 2 6 6
10 2 3 5 7
11 2 4 4 8
12 2 5 2 10
13 3 0 20 0
14 3 1 15 5
15 3 2 16 4
16 3 3 9 11
17 3 4 2 18
18 3 5 2 18

Apply same function to several data replicates in R

Consider the following data simulation mechanism:
set.seed(1)
simulW <- function(G)
{
# Let G be the number of groups
n<-2*G #Assume 2 individuals per group
i<-rep(1:G, rep(2,G)) # Group index
j<-rep (1:n)
Y<-rbinom(n, 1, 0.5) # binary
data.frame(id=1:n, i,Y)
}
r<-5 #5 replicates
dat1 <- replicate(r, simulW(G = 10 ), simplify=FALSE)
#For example the first data replicate will be
> dat1[[1]]
id i Y
1 1 1 0
2 2 1 1
3 3 2 0
4 4 2 0
5 5 3 0
6 6 3 0
7 7 4 0
8 8 4 1
9 9 5 1
10 10 5 0
The code below can perform group wise (i is the group) sum of Y but by default considers only the first replicate i.e dat1[[1]].
Di<-aggregate( Y, by=list ( i ),FUN=sum) #Sum per group for the first dataset
e<-colSums(Di [ 2 ] ) #Total sum of Y for all groups for dataset 1
> e
x
8
di<-Di [ 2 ] # Groupwise sum for replicate 1
> di
x
1 2
2 2
3 2
4 0
5 2
How can I use the same function to perform the group wise sum for the other replicates.
Maybe something like:
for (m in 1:r )
{
Di[m]<-
e[m]<-
di[m]<-
}
You may use aggregate in lapply -
result <- lapply(dat1, function(x) aggregate(Y~i, x, sum))
result
#[[1]]
# i Y
#1 1 1
#2 2 1
#3 3 0
#4 4 0
#5 5 1
#6 6 1
#7 7 0
#8 8 2
#9 9 1
#10 10 1
#[[2]]
# i Y
#1 1 2
#2 2 2
#3 3 2
#4 4 0
#5 5 2
#6 6 1
#7 7 0
#8 8 0
#9 9 1
#10 10 1
#...
#...
We may use tidyverse
library(purrr)
library(dplyr)
map(dat1, ~ .x %>%
group_by(i) %>%
summarise(Y = sum(Y)))
-output
[[1]]
# A tibble: 10 × 2
i Y
<int> <int>
1 1 0
2 2 2
3 3 1
4 4 2
5 5 1
6 6 0
7 7 1
8 8 1
9 9 2
10 10 1
[[2]]
# A tibble: 10 × 2
i Y
<int> <int>
1 1 1
2 2 1
3 3 0
4 4 0
5 5 1
6 6 1
7 7 0
8 8 2
9 9 1
10 10 1
[[3]]
# A tibble: 10 × 2
i Y
<int> <int>
1 1 2
2 2 2
3 3 2
4 4 0
5 5 2
6 6 1
7 7 0
8 8 0
9 9 1
10 10 1
[[4]]
# A tibble: 10 × 2
i Y
<int> <int>
1 1 1
2 2 0
3 3 1
4 4 1
5 5 1
6 6 1
7 7 0
8 8 1
9 9 1
10 10 2
[[5]]
# A tibble: 10 × 2
i Y
<int> <int>
1 1 1
2 2 0
3 3 1
4 4 1
5 5 0
6 6 0
7 7 2
8 8 2
9 9 0
10 10 2

Sorting a specific range of column names in dplyr

I have a data frame and wish to sort specific columns alphabetically in dplyr. I know I can use the code below to sort all columns, but I would only like to sort columns C, B and A alphabetically. I tried using the across function as I would effectively like to select columns C:A, but this did not work.
df <- data.frame(1:16)
df$Testinfo1 <- 1
df$Band <- 1
df$Alpha <- 1
df$C <- c(10,12,14,16,10,12,14,16,10,12,14,16,10,12,14,16)
df$B <- c(10,0,0,0,12,12,12,12,0,14,NA_real_,14,16,16,16,16)
df$A <- c(1,1,1,1,1,1,1,1,1,1,1,14,NA_real_,NA_real_,NA_real_,16)
df
df %>%
select(sort(names(.)))
A Alpha B Band C Testinfo1 X1.16
1: 1 1 10 1 10 1 1
2: 1 1 0 1 12 1 2
3: 1 1 0 1 14 1 3
4: 1 1 0 1 16 1 4
5: 1 1 12 1 10 1 5
6: 1 1 12 1 12 1 6
7: 1 1 12 1 14 1 7
8: 1 1 12 1 16 1 8
9: 1 1 0 1 10 1 9
10: 1 1 14 1 12 1 10
11: 1 1 NA 1 14 1 11
12: 14 1 14 1 16 1 12
13: NA 1 16 1 10 1 13
14: NA 1 16 1 12 1 14
15: NA 1 16 1 14 1 15
16: 16 1 16 1 16 1 16
My desired output is below:
X1.16 Testinfo1 Band Alpha A B C
1: 1 1 1 1 1 10 10
2: 2 1 1 1 1 0 12
3: 3 1 1 1 1 0 14
4: 4 1 1 1 1 0 16
5: 5 1 1 1 1 12 10
6: 6 1 1 1 1 12 12
7: 7 1 1 1 1 12 14
8: 8 1 1 1 1 12 16
9: 9 1 1 1 1 0 10
10: 10 1 1 1 1 14 12
11: 11 1 1 1 1 NA 14
12: 12 1 1 1 14 14 16
13: 13 1 1 1 NA 16 10
14: 14 1 1 1 NA 16 12
15: 15 1 1 1 NA 16 14
16: 16 1 1 1 16 16 16
You can use relocate() (from dplyr 1.0.0 onwards):
library(dplyr)
vars <- c("C", "B", "A")
df %>%
relocate(all_of(sort(vars)), .after = last_col())
If you are passing a character vector of names you should wrap it in all_of() (which will error if any variables are missing) or any_of() which won't.
You can do
sortcols <- c("A","B","C")
library(dplyr)
df %>%
select(-sortcols, sort(sortcols))
The -sortcols part selects everything but the columns you want to sort and then you put the columns you want after those.
A base R option for a case which may or may not exist. If the columns that you want to sort are not at the end of the dataframe.
We add a new column D which you don't want to change the position of.
df$D <- 1:16
cols_to_sort <- c('A', 'B', 'C')
inds <- match(cols_to_sort, names(df))
cols <- seq_along(df)
cols[cols %in% inds] <- inds
df[cols]
# X1.16 Testinfo1 Band Alpha A B C D
#1 1 1 1 1 1 10 10 1
#2 2 1 1 1 1 0 12 2
#3 3 1 1 1 1 0 14 3
#4 4 1 1 1 1 0 16 4
#5 5 1 1 1 1 12 10 5
#6 6 1 1 1 1 12 12 6
#7 7 1 1 1 1 12 14 7
#8 8 1 1 1 1 12 16 8
#9 9 1 1 1 1 0 10 9
#10 10 1 1 1 1 14 12 10
#11 11 1 1 1 1 NA 14 11
#12 12 1 1 1 14 14 16 12
#13 13 1 1 1 NA 16 10 13
#14 14 1 1 1 NA 16 12 14
#15 15 1 1 1 NA 16 14 15
#16 16 1 1 1 16 16 16 16

How to calculate recency in R

I have the following data:
set.seed(20)
round<-rep(1:10,2)
part<-rep(1:2, c(10,10))
game<-rep(rep(1:2,c(5,5)),2)
pay1<-sample(1:10,20,replace=TRUE)
pay2<-sample(1:10,20,replace=TRUE)
pay3<-sample(1:10,20,replace=TRUE)
decs<-sample(1:3,20,replace=TRUE)
previous_max<-c(0,1,0,0,0,0,0,1,0,0,0,0,1,1,1,0,0,1,1,0)
gamematrix<-cbind(part,game,round,pay1,pay2,pay3,decs,previous_max )
gamematrix<-data.frame(gamematrix)
Here is the output:
part game round pay1 pay2 pay3 decs previous_max
1 1 1 1 9 5 6 2 0
2 1 1 2 8 1 1 1 1
3 1 1 3 3 5 5 3 0
4 1 1 4 6 1 5 1 0
5 1 1 5 10 3 8 3 0
6 1 2 6 10 1 5 1 0
7 1 2 7 1 10 7 3 0
8 1 2 8 1 10 8 2 1
9 1 2 9 4 1 5 1 0
10 1 2 10 4 7 7 2 0
11 2 1 1 8 4 1 1 0
12 2 1 2 8 5 5 2 0
13 2 1 3 1 9 3 1 1
14 2 1 4 8 2 10 2 1
15 2 1 5 2 6 2 3 1
16 2 2 6 5 5 6 2 0
17 2 2 7 4 5 1 2 0
18 2 2 8 2 10 5 2 1
19 2 2 9 3 7 3 2 1
20 2 2 10 9 3 1 1 0
How can I calculate a new indicator variable "previous_max",which returns whether in the next round of the same game, the same participant choose the maximal payoff from the previous round.
So I want something like follows:
Participant (part) 1:
In the first round of each game, previous_max is "0" (no previous round), in round 2, previous_max ="1", because in round 1, the maximal pay was max(pay1,pay2,pay3)=max(9,5,6)=9, and in round 2, the participant's decisions (decs) was 1 (which was the maximal value in previous round).
In round 3, previous_max=0, because the maximal value in round 2 was 8 (which is "pay1"), but the participant choose "3" (which is pay3).
Here's a solution using dplyr and purr::map.
I would have preferred to use group_by than split but max.col ignores groups and I don't know of a dplyr equivalent`.
the output is slightly different but I think it's because of your mistakes, please explain if not and I'll update my answer.
library(purrr)
library(dplyr)
gamematrix %>%
split(.$part) %>%
map(~ .x %>% mutate(
prev_max = as.integer(
decs ==
c(0,max.col(.[c("pay1","pay2","pay3")])[-n()]) # the number of the max columns, offset by one
))) %>%
bind_rows
# ` part game round pay1 pay2 pay3 decs prev_max
# 1 1 1 1 9 5 6 2 0
# 2 1 1 2 8 1 1 1 1
# 3 1 1 3 3 5 5 3 0
# 4 1 1 4 6 1 5 1 0
# 5 1 1 5 10 3 8 3 0
# 6 1 2 6 10 1 5 1 1
# 7 1 2 7 1 10 7 3 0
# 8 1 2 8 1 10 8 2 1
# 9 1 2 9 4 1 5 1 0
# 10 1 2 10 4 7 7 2 0
# 11 2 1 1 8 4 1 1 0
# 12 2 1 2 8 5 5 2 0
# 13 2 1 3 1 9 3 1 1
# 14 2 1 4 8 2 10 2 1
# 15 2 1 5 2 6 2 3 1
# 16 2 2 6 5 5 6 2 1
# 17 2 2 7 4 5 1 2 0
# 18 2 2 8 2 10 5 2 1
# 19 2 2 9 3 7 3 2 1
# 20 2 2 10 9 3 1 1 0

Resources