Extend numerical series in data frame - r

Data
Let's take a look at a simple dataset (mine is actually >200,000 rows):
df <- data.frame(
id = c(rep(1, 11), rep(2,6)),
ref.pos = c(NA,NA,NA,301,302,303,800,801,NA,NA,NA, 500,501,502, NA, NA, NA),
pos = c(1:11, 30:35)
)
Which thus looks like this:
id ref.pos pos
1 1 NA 1
2 1 NA 2
3 1 NA 3
4 1 301 4
5 1 302 5
6 1 303 6
7 1 800 7
8 1 801 8
9 1 NA 9
10 1 NA 10
11 1 NA 11
12 2 500 30
13 2 501 31
14 2 502 32
15 2 NA 33
16 2 NA 34
17 2 NA 35
What I want to achieve
Per id I want to extend the numbers in the ref.pos to fill out the whole column, where the ref.pos numbers go down moving up in the data frame and up moving down in the colum. This would result in the following data frame:
id ref.pos pos
1 1 298 1
2 1 299 2
3 1 300 3
4 1 301 4
5 1 302 5
6 1 303 6
7 1 800 7
8 1 801 8
9 1 802 9
10 1 803 10
11 1 804 11
12 2 500 30
13 2 501 31
14 2 502 32
15 2 503 33
16 2 504 34
17 2 505 35
What I tried
I wish I could provide some code here however I haven't figure out a proper way in two days, especially not something applicable to large datasets. I found df %>% group_by(id) %>% tidyr::fill(ref.pos, .direction = "downup") interesting however this repeats numbers rather than going down and up for me.
I hope my question is clear, otherwise let me know in the comments!

An option using data.table:
fillends <- function(x) nafill(nafill(x, "locf"), "nocb")
setDT(df)[, ref.pos2 := {
dif <- fillends(c(diff(ref.pos), NA_integer_))
frp <- fillends(ref.pos)
fp <- fillends(replace(pos, is.na(ref.pos), NA_integer_))
fifelse(is.na(ref.pos), frp + dif*(pos - fp), ref.pos)
}, id]
output:
id ref.pos pos ref.pos2
1: 1 NA 1 298
2: 1 NA 2 299
3: 1 NA 3 300
4: 1 301 4 301
5: 1 302 5 302
6: 1 303 6 303
7: 1 802 7 802
8: 1 801 8 801
9: 1 NA 9 800
10: 1 NA 10 799
11: 1 NA 11 798
12: 2 500 30 500
13: 2 501 31 501
14: 2 502 32 502
15: 2 NA 33 503
16: 2 NA 34 504
17: 2 NA 35 505
data:
df <- data.frame(
id = c(rep(1, 11), rep(2,6)),
ref.pos = c(NA,NA,NA,301,302,303,802,801,NA,NA,NA, 500,501,502, NA, NA, NA),
pos = c(1:11, 30:35)
)

A base R option is to define custom function fill, which is applied in ave
fill <- function(v) {
inds <- range(which(!is.na(v)))
l <- 1:inds[1]
u <- inds[2]:length(v)
v[l] <- v[inds[1]] - rev(l)+1
v[u] <- v[inds[2]] + seq_along(u)-1
v
}
df <- within(df,ref.pos <- ave(ref.pos,id,FUN = fill))
such that
> df
id ref.pos pos
1 1 298 1
2 1 299 2
3 1 300 3
4 1 301 4
5 1 302 5
6 1 303 6
7 1 800 7
8 1 801 8
9 1 802 9
10 1 803 10
11 1 804 11
12 2 500 30
13 2 501 31
14 2 502 32
15 2 503 33
16 2 504 34
17 2 505 35

Related

Colsums using loop from indices in list in r

Hello I have a DF with multiple columns all containing numeric values. My df contains over 200 columns but the sample should do. I would like to take the values from the list of indices and using them in a RowSums loop so that the list name is the new column and the sums are the combo of indices
Main <- c(rep(1, times = 6), rep(2, times = 6))
Feature1 <- sample(1:20, 12, replace = T)
Feature2 <- sample(400:500, 12, replace = T)
Feature3 <- sample(1:5, 12, replace = T)
df.main <- data.frame(Main, Feature1, Feature2,
Feature3, stringsAsFactors = FALSE)
Main Feature1 Feature2 Feature3
1 1 6 483 3
2 1 9 405 1
3 1 18 494 5
4 1 7 499 5
5 1 13 436 1
6 1 2 451 3
7 2 4 456 3
8 2 19 442 5
9 2 16 437 2
10 2 4 497 4
11 2 7 497 3
12 2 5 466 1
list(`Cool Ranch|Cool Chipotle` = c(1L, 4L,), `Trust|Scotia` = c(3L,
4L))
I want my output to look like this
Main Feature1 Feature2 Feature3 cool_ranch trust_scotia
1 1 6 483 3 4 486
2 1 9 405 1 2 406
3 1 18 494 5 6 499
4 1 7 499 5 6 504
5 1 13 436 1 2 437
6 1 2 451 3 4 454
7 2 4 456 3 5 459
8 2 19 442 5 7 447
9 2 16 437 2 4 439
10 2 4 497 4 6 501
11 2 7 497 3 5 500
12 2 5 466 1 3 467
I have tried a few things along the same lines as below
> sum.test<- apply(df.main, 2, function(i) rowSums[vlist.imps$i])
Error in rowSums[vlist.imps$i] :
object of type 'closure' is not subsettable
We can use loop over the 'vlist.imps', extract the columns of 'df.main' with those index, get the rowSums and assign the output back to create new columns
df.main[names(vlist.imps)] <- lapply(vlist.imps, function(x) rowSums(df.main[x]))

Taking values from a dataframe to loop for operations into another dataframe

vocab
wordIDx V1
1 archive
2 name
3 atheism
4 resources
5 alt
wordIDx newsgroup_ID docIdx word/doc totalwords/doc totalwords/newsgroup wordID/newsgroup P(W_j)
1 1 196 3 1240 47821 2 0.028130269
1 1 47 2 1220 47821 2 0.028130269
2 12 4437 1 702 47490 8 0.8
3 12 4434 1 673 47490 8 0.035051912
5 12 4398 1 53 47490 8 0.4
3 12 4564 11 1539 47490 8 0.035051912
For each wordIDx in vocab, I need to compute the following formulae:
For instance wordIDx = 1 ;
my value should be
max(log(0.02813027)+sum(log(2/47821),log(2/47821)))
= -23.73506
I have the following code for now:
classifier_3$ans<- max(log(classifier_3$`P(W_j)`)+ (sum(log(classifier_3$`wordID/newsgroup`/classifier_3$`totalwords/newsgroup`))))
How can I loop in a way that it considers all wordIDx from vocab dataframe and computes the above example as I have highlighted.
Something like this, but you really need to clean your column names.
vocab <- read.table(text = "wordIDx V1
1 archive
2 name
3 atheism
4 resources
5 alt", header = TRUE, stringsAsFactors = FALSE)
classifier_3 <- read.table(text = "wordIDx newsgroup_ID docIdx word/doc totalwords/doc totalwords/newsgroup wordID/newsgroup P(W_j)
1 1 196 3 1240 47821 2 0.028130269
1 1 47 2 1220 47821 2 0.028130269
2 12 4437 1 702 47490 8 0.8
3 12 4434 1 673 47490 8 0.035051912
5 12 4398 1 53 47490 8 0.4
3 12 4564 11 1539 47490 8 0.035051912", header = TRUE, stringsAsFactors = FALSE)
classifier_3 <- classifier_3[!duplicated(classifier_3$wordIDx), ]
classifier_3 <- merge(vocab, classifier_3, by = c("wordIDx"))
classifier_3$ans<- pmax(log(classifier_3$`P.W_j.`)+
(log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`) +
# isn't that times 2?
log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`)),
log(classifier_3$`wordID.newsgroup`/classifier_3$`totalwords.newsgroup`))

Merging Data frames and creating columns based on conditions

I have 2 data frames
Data Frame A:
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
Data Frame B:
TimeStart TimeEnd Alarm
2 5 556
7 9 556
I would like to create the following joined dataframe:
Time Reading Alarmtime Alarm alarmno
1 20 n/a n/a n/a
2 23 2 556 1
3 25 556 1
4 22 556 1
5 24 5 556 1
6 23 n/a n/a n/a
7 24 7 556 2
8 23 556 2
9 23 9 556 2
10 22 n/a n/a n/a
I can do the join easy enough however im struggling with getting the following rows filled with the alarm until the time the alarm ended. Also numbering each individual alarm so even if they are the same alarm they are counted separately. Any thoughts on how i can do this would be great
Thanks
library(sqldf)
df_b$AlarmNo <- seq_len(nrow(df_b))
sqldf('
select a.Time
, a.Reading
, case when a.Time in (b.TimeStart, b.TimeEnd)
then a.Time
else NULL
end as AlarmTime
, b.Alarm
, b.AlarmNo
from df_a a
left join df_b b
on a.Time between b.TimeStart and b.TimeEnd
')
# Time Reading AlarmTime Alarm AlarmNo
# 1 1 20 NA NA NA
# 2 2 23 2 556 1
# 3 3 25 NA 556 1
# 4 4 22 NA 556 1
# 5 5 24 5 556 1
# 6 6 23 NA NA NA
# 7 7 24 7 556 2
# 8 8 23 NA 556 2
# 9 9 23 9 556 2
# 10 10 22 NA NA NA
Or
library(data.table)
setDT(df_b)
df_c <-
df_b[, .(Time = seq(TimeStart, TimeEnd), Alarm, AlarmNo = .GRP)
, by = TimeStart]
merge(df_a, df_c, by = 'Time', all.x = T)
# Time Reading TimeStart Alarm AlarmNo
# 1: 1 20 NA NA NA
# 2: 2 23 2 556 1
# 3: 3 25 2 556 1
# 4: 4 22 2 556 1
# 5: 5 24 2 556 1
# 6: 6 23 NA NA NA
# 7: 7 24 7 556 2
# 8: 8 23 7 556 2
# 9: 9 23 7 556 2
# 10: 10 22 NA NA NA
Data used:
df_a <- fread('
Time Reading
1 20
2 23
3 25
4 22
5 24
6 23
7 24
8 23
9 23
10 22
')
df_b <- fread('
TimeStart TimeEnd Alarm
2 5 556
7 9 556
')

Efficient way in R to add a new column to a dataframe with huge dataset

I really need to speed some R code up. I have a large dataset from a particular sport. Each row in the data frame represents some type of action in the game. For each game (game_id) we have two teams (team_id) that take part in the game. time_ref in the data frame are the actions in chronological order for each game. type_id is the type of action in the game. player_off is set as TRUE or FALSE and is linked to action_id=3. action_id=3 represents a player getting a card and player_off is set to TRUE/FALSE if the player was sent off when they got that card. Example data.frame:
> df
game_id team_id action_id player_off time_ref
100 10 1 NA 1000
100 10 1 NA 1001
100 10 1 NA 1002
100 11 1 NA 1003
100 11 2 NA 1004
100 11 1 NA 1005
100 10 3 1 1006
100 11 1 NA 1007
100 10 1 NA 1008
100 10 1 NA 1009
101 12 3 0 1000
101 12 1 NA 1001
101 12 1 NA 1002
101 13 2 NA 1003
101 13 3 1 1004
101 12 1 NA 1005
101 13 1 NA 1006
101 13 1 NA 1007
101 12 1 NA 1008
101 12 1 NA 1009
What I need is another column in the data frame that gives me TRUE or FALSE on whether both teams had an equal/unequal number of players on the field while each action (row) took place.
So game_id=100 had an action_id=3 & player_off=1 for team_id=10 at time_ref=1006. So we know the teams were equal with number of players on the field up to that point but unequal for the rest of the game (time_ref>1006). The same thing occurred in game_id=101 also.
This an example of the data frame with an extra column I would like to have for the dataset.
>df
game_id team_id action_id player_off time_ref is_even
100 10 1 NA 1000 1
100 10 1 NA 1001 1
100 10 1 NA 1002 1
100 11 1 NA 1003 1
100 11 2 NA 1004 1
100 11 1 NA 1005 1
100 10 3 1 1006 1
100 11 1 NA 1007 0
100 10 1 NA 1008 0
100 10 1 NA 1009 0
101 12 3 0 1000 1
101 12 1 NA 1001 1
101 12 1 NA 1002 1
101 13 2 NA 1003 1
101 13 3 1 1004 1
101 12 1 NA 1005 0
101 13 1 NA 1006 0
101 13 1 NA 1007 0
101 12 1 NA 1008 0
101 12 1 NA 1009 0
So you can see that in game_id=100 a player was sent off at time_ref=1006 so all previous rows were marked as is_even=1 and subsequent marked as uneven or 0. Similar for game_id=101 at time_ref=1004.
What is the most efficient way of achieving this extra column? Preferably not using for loops.
For some vector
x = c(0, NA, NA, NA, 1, NA, NA, NA)
write a function to standardize the data (0 or 1 player lost), calculate the cumulative number of players lost, and compare this to zero,
fun0 = function(x) {
x[is.na(x)] = 0
cumsum(x) == 0
}
For several groups, use ave() with a grouping variable
x = c(x, rev(x))
grp = rep(1:2, each = length(x) / 2)
ave(x, grp, FUN = fun0)
For the data in the question, try
df$is_even = ave(df$player_off, df$game_id, FUN = fun)
Semantically, it seems likely that fun0() is more complicated than implied in this solution, specifically that if each team loses a player, they are again even, as #SunLisa says. If so, clean the data
df$player_off[is.na(df$player_off)] = 0
and change fun0(), e.g.,
fun1 <- function(x, team) {
is_team_1 <- team == head(team, 1) # is 'team' the first team?
x1 <- x & is_team_1 # lost player & team 1
x2 <- x & !is_team_1 # lost player & team 2
cumsum(x1) == cumsum(x2) # same total number of players?
}
(it doesn't seem like a good idea to coerce the logical return value to an integer). This could be applied by group with
df$is_even = ave(seq_len(nrow(df)), df$game_id, FUN = function(i) {
fun1(df$player_off[i], df$team_id[i])
})
or
split(df$is_even, df$game_id) <-
Map(fun1,
split(df$player_off, df$game_id),
split(df$team_id, df$game_id)
)
The implementation of ave() is useful to look at, the important line being
split(x, g) <- lapply(split(x, g), FUN)
The right-hand side splits x by group g, then applies FUN() to each group. The left-hand side split<-() is a tricky operation, using the group indexes to update the original vector x.
Comments
The original question asked for 'no for loops', but actually lapply() (in ave()) and Map() are exactly that; ave() is relatively efficient because of the split-apply-combine strategy it adopts, rather than what the OP probably implemented, which was likely to iterate through games, subset the data frame, then update the data.frame for each game. The subsetting would have duplicated subsets of the entire data set, and the update in particular would have copied at least the entire result column on each assignment; this copying would have slowed the execution down alot. It's also possible that the OP was struggling with fun0(); it would help to clarify the question, especially title, to identify that as the problem.
There are faster ways, especially using the data.table package, but the principle is the same -- identify a function that operates on a vector the way you'd like, and apply it by group.
An alternative, fully-vectorized, solution follows this suggestion to calculate a cumulative sum by group. For fun0(), standardize x to be the number of players leaving the game at a particular timepoint, without NAs
x[is.na(x)] = 0
For the equivalent of fun(), calculate the cumulative sum of players leaving the game, irrespective of group
cs = cumsum(x)
Correct this for the group that the cumulative sum applies to
in_game = cs - (grp - 1)
and set this to 'TRUE' when 0 players have left the game
is_even = (in_game == 0)
This relies on grp indexing from 1 to the number of groups; for the data here one might grp = match(df$game_id, unique(df$game_id)). A similar solution exists for fun1().
Here's a dplyr + tidyr solution to the problem, with the summary of what was done:
Manipulate the data by converting all NAs in player_off to 0 for easier summing and assigning the smaller team_num (assuming there are only 2) to team1 and the other to team2
"Tally" the player_offs using spread and fill the invalid combinations in the data with 0 -- for example, in game_id = 100, there's no team_id = 11 for time_ref = 1000
Take the cumulative sum of the lagged team1 and team2 vectors (and of course fill NAs with 0)
Code below:
require(dplyr)
require(tidyr)
df %>%
group_by(game_id) %>%
mutate(
player_off = player_off %>% replace(list = is.na(.), values = 0),
team_num = if_else(team_id == min(team_id), "team1", "team2")
) %>%
spread(key = team_num, value = player_off, fill = 0) %>%
arrange(game_id, time_ref) %>%
mutate(
team1_cum = cumsum(lag(team1, default = 0)),
team2_cum = cumsum(lag(team2, default = 0)),
is_even = as.integer(team1_cum == team2_cum)
) %>%
ungroup() %>%
select(-team1, -team2, -team1_cum, -team2_cum)
Output:
# A tibble: 20 x 5
game_id team_id action_id time_ref is_even
<int> <int> <int> <int> <int>
1 100 10 1 1000 1
2 100 10 1 1001 1
3 100 10 1 1002 1
4 100 11 1 1003 1
5 100 11 2 1004 1
6 100 11 1 1005 1
7 100 10 3 1006 1
8 100 11 1 1007 0
9 100 10 1 1008 0
10 100 10 1 1009 0
11 101 12 3 1000 1
12 101 12 1 1001 1
13 101 12 1 1002 1
14 101 13 2 1003 1
15 101 13 3 1004 1
16 101 12 1 1005 0
17 101 13 1 1006 0
18 101 13 1 1007 0
19 101 12 1 1008 0
20 101 12 1 1009 0
Here's my think:
data.table is going to work well, especially when you are working with large data sets. It's faster. We just need to group it, cumsum 2 team's layoff, and see if they equal.
First I have to say:
(problem solved by Martin Morgan, his updated answer no longer has this error)
I don't think #Martin Morgan 's answer is right. Let's imagine a certain case:
when team 1 had one player off, after which team 2 had another player off, then 2 teams should be even, but #Martin Morgan's output would be FALSE.
I'll make an example with this dataset, where player_off of record 19 was modified to 1, which means that in game 101, after team 13 had had 1 player off at 1004, team 12 had 1 player off at 1008, which would make 2 teams even at 1009.
> dt.1
game_id team_id action_id player_off time_ref
1 100 10 1 NA 1000
2 100 10 1 NA 1001
3 100 10 1 NA 1002
4 100 11 1 NA 1003
5 100 11 2 NA 1004
6 100 11 1 NA 1005
7 100 10 3 1 1006
8 100 11 1 NA 1007
9 100 10 1 NA 1008
10 100 10 1 NA 1009
11 101 12 3 0 1000
12 101 12 1 NA 1001
13 101 12 1 NA 1002
14 101 13 2 NA 1003
15 101 13 3 1 1004
16 101 12 1 NA 1005
17 101 13 1 NA 1006
18 101 13 1 NA 1007
19 101 12 1 1 1008
20 101 12 1 NA 1009
But #Martin Morgan 's function would produce this output:
> dt.1$is_even = ave(df$player_off, df$game_id, FUN = fun)
> dt.1
game_id team_id action_id player_off time_ref is_even
1 100 10 1 NA 1000 1
2 100 10 1 NA 1001 1
3 100 10 1 NA 1002 1
4 100 11 1 NA 1003 1
5 100 11 2 NA 1004 1
6 100 11 1 NA 1005 1
7 100 10 3 1 1006 1
8 100 11 1 NA 1007 0
9 100 10 1 NA 1008 0
10 100 10 1 NA 1009 0
11 101 12 3 0 1000 1
12 101 12 1 NA 1001 1
13 101 12 1 NA 1002 1
14 101 13 2 NA 1003 1
15 101 13 3 1 1004 1
16 101 12 1 NA 1005 0
17 101 13 1 NA 1006 0
18 101 13 1 NA 1007 0
19 101 12 1 1 1008 0
20 101 12 1 NA 1009 0
Notice how at line 19 and line 20, is.even=0. Which is not what op wants.
My code does not process NAs, so I am going to transform NA to 0 first.
> dt.1<-as.data.table(dt.1)
> dt.1[is.na(dt.1)]<-0
My code would produce the correct output, at time 1008 and 1009, where both team 12 and team 13 had 1 off, two teams are even.
> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id][order(game_id,time_ref)][,.(team_id,time_ref,action_id,player_off,even=as.numeric(cumsum(team2_off)==cumsum(team1_off))),by=game_id]
game_id team_id time_ref action_id player_off even
1: 100 10 1000 1 0 1
2: 100 10 1001 1 0 1
3: 100 10 1002 1 0 1
4: 100 11 1003 1 0 1
5: 100 11 1004 2 0 1
6: 100 11 1005 1 0 1
7: 100 10 1006 3 1 0
8: 100 11 1007 1 0 0
9: 100 10 1008 1 0 0
10: 100 10 1009 1 0 0
11: 101 12 1000 3 0 1
12: 101 12 1001 1 0 1
13: 101 12 1002 1 0 1
14: 101 13 1003 2 0 1
15: 101 13 1004 3 1 0
16: 101 12 1005 1 0 0
17: 101 13 1006 1 0 0
18: 101 13 1007 1 0 0
19: 101 12 1008 1 1 1
20: 101 12 1009 1 0 1
I understand it is a messy looking chunk of data.table code, let me explain step by step.
dt[, .(
action_id,
team2_off = (team_id == max(team_id)) * player_off,
team1_off = (team_id == min(team_id)) * player_off,
team_id,
time_ref,
player_off
), by = game_id][order(game_id, time_ref)][, .(team_id,
time_ref,
action_id,
player_off,
even = cumsum(team2_off) == cumsum(team1_off)), by = game_id]
first, we take data.table dt, group by game_id, and does this calculation:
team2_off = (team_id == max(team_id)) * player_off,
team1_off = (team_id == min(team_id)) * player_off
data.table has some problem taking 2 grouping at once (group by game_id and team_id), but it handles logical expression inside of each group well. In this way, we effectively get team1_off and team2_off, by multiplying a logical output of team_id == max/min(team_id) with player_off. When both are 1, the output would be 1, which means, 1 player was off in the selected team.
Now we have a data table of:
> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id]
game_id action_id team2_off team1_off team_id time_ref player_off
1: 100 1 0 0 10 1000 0
2: 100 1 0 0 10 1001 0
3: 100 1 0 0 10 1002 0
4: 100 1 0 0 11 1003 0
5: 100 2 0 0 11 1004 0
6: 100 1 0 0 11 1005 0
7: 100 3 0 1 10 1006 1
8: 100 1 0 0 11 1007 0
9: 100 1 0 0 10 1008 0
10: 100 1 0 0 10 1009 0
11: 101 3 0 0 12 1000 0
12: 101 1 0 0 12 1001 0
13: 101 1 0 0 12 1002 0
14: 101 2 0 0 13 1003 0
15: 101 3 1 0 13 1004 1
16: 101 1 0 0 12 1005 0
17: 101 1 0 0 13 1006 0
18: 101 1 0 0 13 1007 0
19: 101 1 0 1 12 1008 1
20: 101 1 0 0 12 1009 0
Now we no longer need to group by two groups (team_id, game_id), we can just do cumsum by game_id, and compare if cumsum(team1_off)==cumsum(team2_off), also, order it by game_id and time_ref, so the result would have the correct order.
I understand that NAs may have different meanings than 0 in this scenario. If you really care that much, just create a dummy column of player_off.
> dt$dummy<-dt$player_off
> dt$dummy[is.na(dt$dummy)]<-0
> dt<-as.data.table(dt)
> dt[, .(
+ action_id,
+ team2_off = (team_id == max(team_id)) * dummy,
+ team1_off = (team_id == min(team_id)) * dummy,
+ team_id,
+ time_ref,
+ player_off
+ ), by = game_id][order(game_id, time_ref)][, .(team_id,
+ time_ref,
+ action_id,
+ player_off,
+ even = as.numeric(cumsum(team2_off) == cumsum(team1_off))), by = game_id]
game_id team_id time_ref action_id player_off even
1: 100 10 1000 1 NA 1
2: 100 10 1001 1 NA 1
3: 100 10 1002 1 NA 1
4: 100 11 1003 1 NA 1
5: 100 11 1004 2 NA 1
6: 100 11 1005 1 NA 1
7: 100 10 1006 3 1 0
8: 100 11 1007 1 NA 0
9: 100 10 1008 1 NA 0
10: 100 10 1009 1 NA 0
11: 101 12 1000 3 0 1
12: 101 12 1001 1 NA 1
13: 101 12 1002 1 NA 1
14: 101 13 1003 2 NA 1
15: 101 13 1004 3 1 0
16: 101 12 1005 1 NA 0
17: 101 13 1006 1 NA 0
18: 101 13 1007 1 NA 0
19: 101 12 1008 1 NA 0
20: 101 12 1009 1 NA 0
I really think you question is very interesting, and I was dedicated to solve this using data.table. It took me few hours and I almost gave up on data.table, thinking that data.table just can't process two grouping at a time. I eventually solved it with a logical multiplication.
Great fun I had
team1_off = (team_id == min(team_id)) * dummy
team2_off = (team_id == max(team_id)) * dummy

Aggregation of all possible unique combinations with observations in the same column in R

I am trying to shorten a chunk of code to make it faster and easier to modify. This is a short example of my data.
order obs year var1 var2 var3
1 3 1 1 32 588 NA
2 4 1 2 33 689 2385
3 5 1 3 NA 678 2369
4 33 3 1 10 214 1274
5 34 3 2 10 237 1345
6 35 3 3 10 242 1393
7 78 6 1 5 62 NA
8 79 6 2 5 75 296
9 80 6 3 5 76 500
10 93 7 1 NA NA NA
11 94 7 2 4 86 247
12 95 7 3 3 54 207
Basically, what I want is R to find any possible and unique combination of two values (observations) in column "obs", within the same year, to create a new matrix or DF with observations being the aggregation of the originals. Order is not important, so 1+6 = 6+1. For instance, having 150 observations, I will expect 11,175 feasible combinations (each year).
I sort of got what I want with basic coding but, as you will see, is way too long (I have built this way 66 different new data sets so it does not really make a sense) and I am wondering how to shorten it. I did some trials (plyr,...) with no real success. Here what I did:
# For the 1st year, groups of 2 obs
newmatrix <- data.frame(t(combn(unique(data$obs[data$year==1]), 2)))
colnames(newmatrix) <- c("obs1", "obs2")
newmatrix$name <- do.call(paste, c(newmatrix[c("obs1", "obs2")], sep = "_"))
# and the aggregation of var. using indexes, which I will skip here to save your time :)
To ilustrate, here the result, considering above sample, of what I would get for the 1st year. NA is because I only computed those where the 2 values were valid. And only for variables 1 and 3. More, I did the sum but it could be any other possible Function:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 42 NA
2 2 1 6 1_6 37 NA
3 3 1 7 1_7 NA NA
4 4 3 6 3_6 15 NA
5 5 3 7 3_7 NA NA
6 6 6 7 6_7 NA NA
As for the 2 first lines in the 3rd year, same type of matrix:
order obs1 obs2 year var1 var3
1 1 1 3 1_3 NA 3762
2 2 1 6 1_6 NA 2868
.......... etc ............
I hope I explained myself. Thank you in advance for your hints on how to do this more efficient.
I would use split-apply-combine to split by year, find all the combinations, and then combine back together:
do.call(rbind, lapply(split(data, data$year), function(x) {
p <- combn(nrow(x), 2)
data.frame(order=paste(x$order[p[1,]], x$order[p[2,]], sep="_"),
obs1=x$obs[p[1,]],
obs2=x$obs[p[2,]],
year=x$year[1],
var1=x$var1[p[1,]] + x$var1[p[2,]],
var2=x$var2[p[1,]] + x$var2[p[2,]],
var3=x$var3[p[1,]] + x$var3[p[2,]])
}))
# order obs1 obs2 year var1 var2 var3
# 1.1 3_33 1 3 1 42 802 NA
# 1.2 3_78 1 6 1 37 650 NA
# 1.3 3_93 1 7 1 NA NA NA
# 1.4 33_78 3 6 1 15 276 NA
# 1.5 33_93 3 7 1 NA NA NA
# 1.6 78_93 6 7 1 NA NA NA
# 2.1 4_34 1 3 2 43 926 3730
# 2.2 4_79 1 6 2 38 764 2681
# 2.3 4_94 1 7 2 37 775 2632
# 2.4 34_79 3 6 2 15 312 1641
# 2.5 34_94 3 7 2 14 323 1592
# 2.6 79_94 6 7 2 9 161 543
# 3.1 5_35 1 3 3 NA 920 3762
# 3.2 5_80 1 6 3 NA 754 2869
# 3.3 5_95 1 7 3 NA 732 2576
# 3.4 35_80 3 6 3 15 318 1893
# 3.5 35_95 3 7 3 13 296 1600
# 3.6 80_95 6 7 3 8 130 707
This enables you to be very flexible in how you combine data pairs of observations within a year --- x[p[1,],] represents the year-specific data for the first element in each pair and x[p[2,],] represents the year-specific data for the second element in each pair. You can return a year-specific data frame with any combination of data for the pairs, and the year-specific data frames are combined into a single final data frame with do.call and rbind.

Resources