How to renumber group id sequentially in R? - r

I am trying to rename grouped unique id sequentially using dplyr in R.
There are five columns in the data frame as below.
## Load package if necessary
library(tidyverse)
## Set data frame
df <- data.frame(
hid=c(10001,10001,10001,10001,10002,10002,10002,10002,10002,
10003,10003,10003,10003,10003,10003,10004,10004,10004,10004,10004),
mid=c(1,2,3,4,1,2,3,4,5,1,2,3,4,5,6,1,2,3,4,5),
tmc=c(010,01010,0,01020,010,010,010,010,010,010,010,010,0,010,010,010,0,01010,010,01010),
thc=c(010,01010,0,02030,010,020,020,020,030,010,010,010,0,020,030,010,0,02020,030,04040),
mdc=c(000,01010,0,02020,000,010,010,010,010,000,000,010,0,010,020,000,0,02020,010,01010),
itc=c(010,01010,0,02020,020,020,020,020,020,010,010,010,0,020,020,010,0,02020,020,02020)
)
Unique ids are given to each row being grouped by some columns: tmc, thc, mdc and itc.
## Add unique id grouped by tmc, thc, mdc and itc
df.id <- df %>% mutate(id=as.numeric(interaction(tmc,thc,mdc,itc)))
As it does not give sequential ids, I need to rename it.
However, I could not find solution for that. The conditions are:
If tmc, thc, mdc and itc are all 0, id is set as 0 (I do not know the reason but interaction gives 1 for such recoreds in my data frame)
Other ids should be sequentially renamed but need to keep its group. (if ids are set as 4,8,2,2,8, it should be renamed as 1,2,3,3,2)
Followings scripts show what I am doing currently. id is temporary id obtained from interaction function but I need to obtain sequential id indicated in id.desired column.
## Replace unique id sequentially
## IT DOES NOT GIVE DESIRED OUTPUT
# df.id %>% group_by(id) %>% mutate(id2=seq_along(id))
## Desired id is shown in `id.desired`
## `id` is the ones obtained from `interaction` function, which are not set sequentially
hid mid tmc thc mdc itc id id.desired
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
1 10001 1 10 10 0 10 166 1
2 10001 2 1010 1010 1010 1010 595 2
3 10001 3 0 0 0 0 1 0
4 10001 4 1020 2030 2020 2020 796 3
5 10002 1 10 10 0 20 326 4
6 10002 2 10 20 10 20 362 5
7 10002 3 10 20 10 20 362 5
8 10002 4 10 20 10 20 362 5
9 10002 5 10 30 10 20 366 6
10 10003 1 10 10 0 10 166 1
11 10003 2 10 10 0 10 166 1
12 10003 3 10 10 10 10 198 7
13 10003 4 0 0 0 0 1 0
14 10003 5 10 20 10 20 362 5
15 10003 6 10 30 20 20 398 8
16 10004 1 10 10 0 10 166 1
17 10004 2 0 0 0 0 1 0
18 1004 3 1010 2020 2020 2020 791 9
19 10004 4 10 30 10 20 366 6
20 10004 5 1010 4040 1010 2020 767 10
Any suggestions?
I prefer to use dplyr in this operation.
I received some suggestions in the previous question, however it is not the same structure in this case (dummy field does not exist in current data frame).
How to renumber result of intersection/group_indices in R?

A solution using the tidyverse. Notice that I did not use the interaction function. Instead, I used the group_indices function from dplyr to create the group index and then convert to factor and change the levels based on the occurrence order in the column. df2 is the final output.
library(tidyverse)
df2 <- df %>%
filter_at(vars(tmc, thc, mdc, itc), any_vars(. != 0)) %>%
mutate(id = group_indices(., tmc, thc, mdc, itc)) %>%
mutate(id = as.numeric(factor(id, levels = unique(id)))) %>%
left_join(df, ., by = names(df)) %>%
replace_na(list(id = 0))
df2
# hid mid tmc thc mdc itc id
# 1 10001 1 10 10 0 10 1
# 2 10001 2 1010 1010 1010 1010 2
# 3 10001 3 0 0 0 0 0
# 4 10001 4 1020 2030 2020 2020 3
# 5 10002 1 10 10 0 20 4
# 6 10002 2 10 20 10 20 5
# 7 10002 3 10 20 10 20 5
# 8 10002 4 10 20 10 20 5
# 9 10002 5 10 30 10 20 6
# 10 10003 1 10 10 0 10 1
# 11 10003 2 10 10 0 10 1
# 12 10003 3 10 10 10 10 7
# 13 10003 4 0 0 0 0 0
# 14 10003 5 10 20 10 20 5
# 15 10003 6 10 30 20 20 8
# 16 10004 1 10 10 0 10 1
# 17 10004 2 0 0 0 0 0
# 18 10004 3 1010 2020 2020 2020 9
# 19 10004 4 10 30 10 20 6
# 20 10004 5 1010 4040 1010 2020 10

Not sure how to interpret the id.desired column but here is an example based on the two conditions and using data.table:
require(data.table)
df = data.table(df)
df[tmc != 0 & thc != 0 & mdc != 0 & itc != 0, ID := 1:.N, by = .(tmc, thc, mdc, itc)]
df[is.na(ID), ID := 0]

(edited mutate based on your clarification in comments)
Here are the two things I tried to do:
To ensure that id = 0 when certain variables are 0, I used if_else in the mutate function with the specific conditions you specified.
To get id.desired I used dense_rank() function.
Here is the code based on the dataset you shared:
df %>%
mutate(id = if_else(tmc == 0 & thc == 0 & mdc == 0 & itc == 0, 0,
as.numeric(interaction(tmc, thc, mdc, itc, lex.order = TRUE)))) %>%
mutate(id.desired = dense_rank(id) - 1)
The output looks like this
hid mid tmc thc mdc itc id id.desired
1 10001 1 10 10 0 10 227 1
2 10001 2 1010 1010 1010 1010 519 7
3 10001 3 0 0 0 0 0 0
4 10001 4 1020 2030 2020 2020 775 10
5 10002 1 10 10 0 20 228 2
6 10002 2 10 20 10 20 258 4
7 10002 3 10 20 10 20 258 4
8 10002 4 10 20 10 20 258 4
9 10002 5 10 30 10 20 283 5
10 10003 1 10 10 0 10 227 1
11 10003 2 10 10 0 10 227 1
12 10003 3 10 10 10 10 232 3
13 10003 4 0 0 0 0 0 0
14 10003 5 10 20 10 20 258 4
15 10003 6 10 30 20 20 288 6
16 10004 1 10 10 0 10 227 1
17 10004 2 0 0 0 0 0 0
18 10004 3 1010 2020 2020 2020 550 8
19 10004 4 10 30 10 20 283 5
20 10004 5 1010 4040 1010 2020 595 9

Related

How to do the following row operation

Below is the sample data. The task at hand is to sum quarter1 and quarter2 for ownership code 30 but exclude indcode 115. From there complete a new row that contain this sum. In excel, this is very simple but hoping to automate this a bit using R. The bottom half of the desired result is below. First question is would I pivot_wider so that I summing columns not rows?
area <- c(000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000,000000)
indcode <- c(110,111,112,113,114,115,110,111,112,113,114,115)
quarter1 <- c(NA,2,4,6,16,3,NA,1,2,3,8,2)
quarter2 <- c(2,3,5,7,22,1,9,1,2,4,11,1)
ownership <- c(00,00,00,00,00,00,30,30,30,30,30,30)
employment <- data.frame(area,indcode,quarter1,quarter2,ownership)
area indcode quarter1 quarter2 ownership
000000 111 1 1 30
000000 112 2 2 30
000000 113 3 4 30
000000 114 8 11 30
000000 115 2 1 30
000000 993 14 18 30
I've assumed you want this done for area groups, but if not you can delete the group_by(area) line.
employment %>%
group_by(area) %>%
summarize(
across(quarter1:quarter2, ~sum(.x[ownership == 30 & indcode != 115], na.rm = TRUE)),
indcode = 993,
ownership = 30
) %>%
bind_rows(employment, .)
# area indcode quarter1 quarter2 ownership
# 1 0 111 2 3 0
# 2 0 112 4 5 0
# 3 0 113 6 7 0
# 4 0 114 16 22 0
# 5 0 115 3 1 0
# 6 0 111 1 1 30
# 7 0 112 2 2 30
# 8 0 113 3 4 30
# 9 0 114 8 11 30
# 10 0 115 2 1 30
# 11 0 993 14 18 30

Efficient way in R to add a new column to a dataframe with huge dataset

I really need to speed some R code up. I have a large dataset from a particular sport. Each row in the data frame represents some type of action in the game. For each game (game_id) we have two teams (team_id) that take part in the game. time_ref in the data frame are the actions in chronological order for each game. type_id is the type of action in the game. player_off is set as TRUE or FALSE and is linked to action_id=3. action_id=3 represents a player getting a card and player_off is set to TRUE/FALSE if the player was sent off when they got that card. Example data.frame:
> df
game_id team_id action_id player_off time_ref
100 10 1 NA 1000
100 10 1 NA 1001
100 10 1 NA 1002
100 11 1 NA 1003
100 11 2 NA 1004
100 11 1 NA 1005
100 10 3 1 1006
100 11 1 NA 1007
100 10 1 NA 1008
100 10 1 NA 1009
101 12 3 0 1000
101 12 1 NA 1001
101 12 1 NA 1002
101 13 2 NA 1003
101 13 3 1 1004
101 12 1 NA 1005
101 13 1 NA 1006
101 13 1 NA 1007
101 12 1 NA 1008
101 12 1 NA 1009
What I need is another column in the data frame that gives me TRUE or FALSE on whether both teams had an equal/unequal number of players on the field while each action (row) took place.
So game_id=100 had an action_id=3 & player_off=1 for team_id=10 at time_ref=1006. So we know the teams were equal with number of players on the field up to that point but unequal for the rest of the game (time_ref>1006). The same thing occurred in game_id=101 also.
This an example of the data frame with an extra column I would like to have for the dataset.
>df
game_id team_id action_id player_off time_ref is_even
100 10 1 NA 1000 1
100 10 1 NA 1001 1
100 10 1 NA 1002 1
100 11 1 NA 1003 1
100 11 2 NA 1004 1
100 11 1 NA 1005 1
100 10 3 1 1006 1
100 11 1 NA 1007 0
100 10 1 NA 1008 0
100 10 1 NA 1009 0
101 12 3 0 1000 1
101 12 1 NA 1001 1
101 12 1 NA 1002 1
101 13 2 NA 1003 1
101 13 3 1 1004 1
101 12 1 NA 1005 0
101 13 1 NA 1006 0
101 13 1 NA 1007 0
101 12 1 NA 1008 0
101 12 1 NA 1009 0
So you can see that in game_id=100 a player was sent off at time_ref=1006 so all previous rows were marked as is_even=1 and subsequent marked as uneven or 0. Similar for game_id=101 at time_ref=1004.
What is the most efficient way of achieving this extra column? Preferably not using for loops.
For some vector
x = c(0, NA, NA, NA, 1, NA, NA, NA)
write a function to standardize the data (0 or 1 player lost), calculate the cumulative number of players lost, and compare this to zero,
fun0 = function(x) {
x[is.na(x)] = 0
cumsum(x) == 0
}
For several groups, use ave() with a grouping variable
x = c(x, rev(x))
grp = rep(1:2, each = length(x) / 2)
ave(x, grp, FUN = fun0)
For the data in the question, try
df$is_even = ave(df$player_off, df$game_id, FUN = fun)
Semantically, it seems likely that fun0() is more complicated than implied in this solution, specifically that if each team loses a player, they are again even, as #SunLisa says. If so, clean the data
df$player_off[is.na(df$player_off)] = 0
and change fun0(), e.g.,
fun1 <- function(x, team) {
is_team_1 <- team == head(team, 1) # is 'team' the first team?
x1 <- x & is_team_1 # lost player & team 1
x2 <- x & !is_team_1 # lost player & team 2
cumsum(x1) == cumsum(x2) # same total number of players?
}
(it doesn't seem like a good idea to coerce the logical return value to an integer). This could be applied by group with
df$is_even = ave(seq_len(nrow(df)), df$game_id, FUN = function(i) {
fun1(df$player_off[i], df$team_id[i])
})
or
split(df$is_even, df$game_id) <-
Map(fun1,
split(df$player_off, df$game_id),
split(df$team_id, df$game_id)
)
The implementation of ave() is useful to look at, the important line being
split(x, g) <- lapply(split(x, g), FUN)
The right-hand side splits x by group g, then applies FUN() to each group. The left-hand side split<-() is a tricky operation, using the group indexes to update the original vector x.
Comments
The original question asked for 'no for loops', but actually lapply() (in ave()) and Map() are exactly that; ave() is relatively efficient because of the split-apply-combine strategy it adopts, rather than what the OP probably implemented, which was likely to iterate through games, subset the data frame, then update the data.frame for each game. The subsetting would have duplicated subsets of the entire data set, and the update in particular would have copied at least the entire result column on each assignment; this copying would have slowed the execution down alot. It's also possible that the OP was struggling with fun0(); it would help to clarify the question, especially title, to identify that as the problem.
There are faster ways, especially using the data.table package, but the principle is the same -- identify a function that operates on a vector the way you'd like, and apply it by group.
An alternative, fully-vectorized, solution follows this suggestion to calculate a cumulative sum by group. For fun0(), standardize x to be the number of players leaving the game at a particular timepoint, without NAs
x[is.na(x)] = 0
For the equivalent of fun(), calculate the cumulative sum of players leaving the game, irrespective of group
cs = cumsum(x)
Correct this for the group that the cumulative sum applies to
in_game = cs - (grp - 1)
and set this to 'TRUE' when 0 players have left the game
is_even = (in_game == 0)
This relies on grp indexing from 1 to the number of groups; for the data here one might grp = match(df$game_id, unique(df$game_id)). A similar solution exists for fun1().
Here's a dplyr + tidyr solution to the problem, with the summary of what was done:
Manipulate the data by converting all NAs in player_off to 0 for easier summing and assigning the smaller team_num (assuming there are only 2) to team1 and the other to team2
"Tally" the player_offs using spread and fill the invalid combinations in the data with 0 -- for example, in game_id = 100, there's no team_id = 11 for time_ref = 1000
Take the cumulative sum of the lagged team1 and team2 vectors (and of course fill NAs with 0)
Code below:
require(dplyr)
require(tidyr)
df %>%
group_by(game_id) %>%
mutate(
player_off = player_off %>% replace(list = is.na(.), values = 0),
team_num = if_else(team_id == min(team_id), "team1", "team2")
) %>%
spread(key = team_num, value = player_off, fill = 0) %>%
arrange(game_id, time_ref) %>%
mutate(
team1_cum = cumsum(lag(team1, default = 0)),
team2_cum = cumsum(lag(team2, default = 0)),
is_even = as.integer(team1_cum == team2_cum)
) %>%
ungroup() %>%
select(-team1, -team2, -team1_cum, -team2_cum)
Output:
# A tibble: 20 x 5
game_id team_id action_id time_ref is_even
<int> <int> <int> <int> <int>
1 100 10 1 1000 1
2 100 10 1 1001 1
3 100 10 1 1002 1
4 100 11 1 1003 1
5 100 11 2 1004 1
6 100 11 1 1005 1
7 100 10 3 1006 1
8 100 11 1 1007 0
9 100 10 1 1008 0
10 100 10 1 1009 0
11 101 12 3 1000 1
12 101 12 1 1001 1
13 101 12 1 1002 1
14 101 13 2 1003 1
15 101 13 3 1004 1
16 101 12 1 1005 0
17 101 13 1 1006 0
18 101 13 1 1007 0
19 101 12 1 1008 0
20 101 12 1 1009 0
Here's my think:
data.table is going to work well, especially when you are working with large data sets. It's faster. We just need to group it, cumsum 2 team's layoff, and see if they equal.
First I have to say:
(problem solved by Martin Morgan, his updated answer no longer has this error)
I don't think #Martin Morgan 's answer is right. Let's imagine a certain case:
when team 1 had one player off, after which team 2 had another player off, then 2 teams should be even, but #Martin Morgan's output would be FALSE.
I'll make an example with this dataset, where player_off of record 19 was modified to 1, which means that in game 101, after team 13 had had 1 player off at 1004, team 12 had 1 player off at 1008, which would make 2 teams even at 1009.
> dt.1
game_id team_id action_id player_off time_ref
1 100 10 1 NA 1000
2 100 10 1 NA 1001
3 100 10 1 NA 1002
4 100 11 1 NA 1003
5 100 11 2 NA 1004
6 100 11 1 NA 1005
7 100 10 3 1 1006
8 100 11 1 NA 1007
9 100 10 1 NA 1008
10 100 10 1 NA 1009
11 101 12 3 0 1000
12 101 12 1 NA 1001
13 101 12 1 NA 1002
14 101 13 2 NA 1003
15 101 13 3 1 1004
16 101 12 1 NA 1005
17 101 13 1 NA 1006
18 101 13 1 NA 1007
19 101 12 1 1 1008
20 101 12 1 NA 1009
But #Martin Morgan 's function would produce this output:
> dt.1$is_even = ave(df$player_off, df$game_id, FUN = fun)
> dt.1
game_id team_id action_id player_off time_ref is_even
1 100 10 1 NA 1000 1
2 100 10 1 NA 1001 1
3 100 10 1 NA 1002 1
4 100 11 1 NA 1003 1
5 100 11 2 NA 1004 1
6 100 11 1 NA 1005 1
7 100 10 3 1 1006 1
8 100 11 1 NA 1007 0
9 100 10 1 NA 1008 0
10 100 10 1 NA 1009 0
11 101 12 3 0 1000 1
12 101 12 1 NA 1001 1
13 101 12 1 NA 1002 1
14 101 13 2 NA 1003 1
15 101 13 3 1 1004 1
16 101 12 1 NA 1005 0
17 101 13 1 NA 1006 0
18 101 13 1 NA 1007 0
19 101 12 1 1 1008 0
20 101 12 1 NA 1009 0
Notice how at line 19 and line 20, is.even=0. Which is not what op wants.
My code does not process NAs, so I am going to transform NA to 0 first.
> dt.1<-as.data.table(dt.1)
> dt.1[is.na(dt.1)]<-0
My code would produce the correct output, at time 1008 and 1009, where both team 12 and team 13 had 1 off, two teams are even.
> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id][order(game_id,time_ref)][,.(team_id,time_ref,action_id,player_off,even=as.numeric(cumsum(team2_off)==cumsum(team1_off))),by=game_id]
game_id team_id time_ref action_id player_off even
1: 100 10 1000 1 0 1
2: 100 10 1001 1 0 1
3: 100 10 1002 1 0 1
4: 100 11 1003 1 0 1
5: 100 11 1004 2 0 1
6: 100 11 1005 1 0 1
7: 100 10 1006 3 1 0
8: 100 11 1007 1 0 0
9: 100 10 1008 1 0 0
10: 100 10 1009 1 0 0
11: 101 12 1000 3 0 1
12: 101 12 1001 1 0 1
13: 101 12 1002 1 0 1
14: 101 13 1003 2 0 1
15: 101 13 1004 3 1 0
16: 101 12 1005 1 0 0
17: 101 13 1006 1 0 0
18: 101 13 1007 1 0 0
19: 101 12 1008 1 1 1
20: 101 12 1009 1 0 1
I understand it is a messy looking chunk of data.table code, let me explain step by step.
dt[, .(
action_id,
team2_off = (team_id == max(team_id)) * player_off,
team1_off = (team_id == min(team_id)) * player_off,
team_id,
time_ref,
player_off
), by = game_id][order(game_id, time_ref)][, .(team_id,
time_ref,
action_id,
player_off,
even = cumsum(team2_off) == cumsum(team1_off)), by = game_id]
first, we take data.table dt, group by game_id, and does this calculation:
team2_off = (team_id == max(team_id)) * player_off,
team1_off = (team_id == min(team_id)) * player_off
data.table has some problem taking 2 grouping at once (group by game_id and team_id), but it handles logical expression inside of each group well. In this way, we effectively get team1_off and team2_off, by multiplying a logical output of team_id == max/min(team_id) with player_off. When both are 1, the output would be 1, which means, 1 player was off in the selected team.
Now we have a data table of:
> dt.1[,.(action_id,team2_off=(team_id==max(team_id))*player_off,team1_off=(team_id==min(team_id))*player_off,team_id,time_ref,player_off),by=game_id]
game_id action_id team2_off team1_off team_id time_ref player_off
1: 100 1 0 0 10 1000 0
2: 100 1 0 0 10 1001 0
3: 100 1 0 0 10 1002 0
4: 100 1 0 0 11 1003 0
5: 100 2 0 0 11 1004 0
6: 100 1 0 0 11 1005 0
7: 100 3 0 1 10 1006 1
8: 100 1 0 0 11 1007 0
9: 100 1 0 0 10 1008 0
10: 100 1 0 0 10 1009 0
11: 101 3 0 0 12 1000 0
12: 101 1 0 0 12 1001 0
13: 101 1 0 0 12 1002 0
14: 101 2 0 0 13 1003 0
15: 101 3 1 0 13 1004 1
16: 101 1 0 0 12 1005 0
17: 101 1 0 0 13 1006 0
18: 101 1 0 0 13 1007 0
19: 101 1 0 1 12 1008 1
20: 101 1 0 0 12 1009 0
Now we no longer need to group by two groups (team_id, game_id), we can just do cumsum by game_id, and compare if cumsum(team1_off)==cumsum(team2_off), also, order it by game_id and time_ref, so the result would have the correct order.
I understand that NAs may have different meanings than 0 in this scenario. If you really care that much, just create a dummy column of player_off.
> dt$dummy<-dt$player_off
> dt$dummy[is.na(dt$dummy)]<-0
> dt<-as.data.table(dt)
> dt[, .(
+ action_id,
+ team2_off = (team_id == max(team_id)) * dummy,
+ team1_off = (team_id == min(team_id)) * dummy,
+ team_id,
+ time_ref,
+ player_off
+ ), by = game_id][order(game_id, time_ref)][, .(team_id,
+ time_ref,
+ action_id,
+ player_off,
+ even = as.numeric(cumsum(team2_off) == cumsum(team1_off))), by = game_id]
game_id team_id time_ref action_id player_off even
1: 100 10 1000 1 NA 1
2: 100 10 1001 1 NA 1
3: 100 10 1002 1 NA 1
4: 100 11 1003 1 NA 1
5: 100 11 1004 2 NA 1
6: 100 11 1005 1 NA 1
7: 100 10 1006 3 1 0
8: 100 11 1007 1 NA 0
9: 100 10 1008 1 NA 0
10: 100 10 1009 1 NA 0
11: 101 12 1000 3 0 1
12: 101 12 1001 1 NA 1
13: 101 12 1002 1 NA 1
14: 101 13 1003 2 NA 1
15: 101 13 1004 3 1 0
16: 101 12 1005 1 NA 0
17: 101 13 1006 1 NA 0
18: 101 13 1007 1 NA 0
19: 101 12 1008 1 NA 0
20: 101 12 1009 1 NA 0
I really think you question is very interesting, and I was dedicated to solve this using data.table. It took me few hours and I almost gave up on data.table, thinking that data.table just can't process two grouping at a time. I eventually solved it with a logical multiplication.
Great fun I had
team1_off = (team_id == min(team_id)) * dummy
team2_off = (team_id == max(team_id)) * dummy

Dividing proportionally row values based on common identifier and specific column in a data frame

After a merging process, I got a data frame that looks like:
df <- data.frame(trip=c(315,328,422,422,458,652,652,652,699),
catch_kg=c(10,8,12,2,26,4,18,14,11),
age_1=c(0,0,0,0,0,0,0,0,0),
age_2=c(2,1,7.5,7.5,8,11,11,11,13),
id=c(1,2,3,3,4,5,5,5,6))
trip catch_kg age_1 age_2 id
315 10 0 2 1
328 8 0 1 2
422 12 0 7.5 3
422 2 0 7.5 3
458 26 0 8 4
652 4 0 11 5
652 18 0 11 5
652 14 0 11 5
699 11 0 13 6
where trips represents the fishing trip, catch_kg the amount of caught fish (in kg), age_1 & age_2 is the number of individuals in each trip and per age group, and id represents the haul identity in each trip.
In some fishing trips I have more than 1 haul - this can be accessed in the id column, where trips with more than 1 haul have the same id number. For instance: trip number 422 has two hauls (id=3).
At this very moment, for a trip with more than 1 haul, I have that the number of individuals within each age group is equally divided by the number of hauls that appears within that specific trip. For example, in trip 422 I have a total of 15 individuals, but since there are 2 hauls, this number was divided by 2 leading to 7.5 individuals per haul.
What I would like, however, is to compute the number of individuals within each age group as a proportion of the total catch in each haul group.
Thus, at the end I would like to have a data frame that looks like:
trip catch_kg age_1 age_2 id
315 10 0 2 1
328 8 0 1 2
422 12 0 13 3
422 2 0 2 3
458 26 0 8 4
652 4 0 4 5
652 18 0 16 5
652 14 0 13 5
699 11 0 13 6
This is basically a rule of three calculation, where for trip 422 (2 hauls), for instance, I would have the following calculation:
haul1: 12*(7.5 + 7.5)/(12 + 2) = 13 individuals
haul2: 2*(7.5 + 7.5)/(12 + 2) = 2 individuals
Is there an easy way to compute these calculations?
Any help would be much appreciated.
-M
You could use dplyr to help with this
library(dplyr)
df %>% group_by(trip) %>%
mutate(age_2=catch_kg/sum(catch_kg)*sum(age_2))
# trip catch_kg age_1 age_2 id
# <dbl> <dbl> <dbl> <dbl> <dbl>
# 1 315 10 0 2.000000 1
# 2 328 8 0 1.000000 2
# 3 422 12 0 12.857143 3
# 4 422 2 0 2.142857 3
# 5 458 26 0 8.000000 4
# 6 652 4 0 3.666667 5
# 7 652 18 0 16.500000 5
# 8 652 14 0 12.833333 5
# 9 699 11 0 13.000000 6
Not sure exactly what rounding rule you were using to get to integer counts of people, but you'd likely run into trouble with parts not adding up to wholes in more complicated scenarios.
Another solution using data.table:
library(data.table)
setDT(df)
df[, age_2 := catch_kg * sum(age_2) / sum(catch_kg), trip]
# trip catch_kg age_1 age_2 id
#1: 315 10 0 2.000000 1
#2: 328 8 0 1.000000 2
#3: 422 12 0 12.857143 3
#4: 422 2 0 2.142857 3
#5: 458 26 0 8.000000 4
#6: 652 4 0 3.666667 5
#7: 652 18 0 16.500000 5
#8: 652 14 0 12.833333 5
#9: 699 11 0 13.000000 6
If you want you can round age_2 with round(): age_2 := round(catch_kg * sum(age_2) / sum(catch_kg))

How to remove rows based on distance from an average of column and max of another column

Consider this toy data frame. I would like to create a new data frame in which only rows that are below the average of "birds" and only rows that less than the two top values after the maximum value of "wolfs".So in this data frame I'll get only rows: 543,608,987,225,988,556.
I used this two lines of code for the first constrain but couldn't find a solution for the second constrain.
df$filt<-ifelse(df$birds<mean(df$birds),1,0)
df1<-df1[which(df1$filt==1),]
How can I create the second constrain ?
Here is the toy dataframe:
df <- read.table(text = "userid target birds wolfs
222 1 9 7
444 1 8 4
234 0 2 8
543 1 2 3
678 1 8 3
987 0 1 2
294 1 7 1
608 0 1 5
123 1 9 7
321 1 8 7
226 0 2 7
556 0 2 3
334 1 6 3
225 0 1 1
999 0 3 9
988 0 1 1 ",header = TRUE)
subset(df,birds < mean(birds) & wolfs < sort(unique(wolfs),decreasing=T)[3]);
## userid target birds wolfs
## 4 543 1 2 3
## 6 987 0 1 2
## 8 608 0 1 5
## 12 556 0 2 3
## 14 225 0 1 1
## 16 988 0 1 1
Here a solution but maybe some constraints are not clear to me because it is fit another row respect your desired output.
avbi <- mean(df$birds)
ttw <- sort(df$wolfs, decreasing = T)[3]
df[df$birds < avbi & df$wolfs < ttw , ]
userid target birds wolfs
4 543 1 2 3
6 987 0 1 2
8 608 0 1 5
12 556 0 2 3
14 225 0 1 1
16 988 0 1 1
or with dplyr
df %>% filter(birds < avbi & wolfs < ttw)

compute a Means variable for a specific value in another variable

I would like to compute the mean age for every value from 1-7 in another variable called period.
This is how my data looks like:
work1 <- read.table(header=T, text="ID dead age gender inclusion_year diagnosis surv agrp period
87 0 25 2 2006 1 2174 1 5
396 0 19 2 2003 1 3077 1 3
446 0 23 2 2003 1 3144 1 3
497 0 19 2 2011 1 268 1 7
522 1 57 2 1999 1 3407 2 1
714 0 58 2 2003 1 3041 2 3
741 0 27 2 2004 1 2587 1 4
767 0 18 1 2008 1 1104 1 6
786 0 36 1 2005 1 2887 3 4
810 0 25 1 1998 1 3783 4 2")
This is a subset of a data with more then 1500 observations
This is what I'm trying to achieve:
sim <- read.table(header=T, text="Period diagnosis dead surv age
1 1 50 50000 35.5
2 1 80 70000 40.3
3 1 100 80000 32.8
4 1 120 100000 39.8
5 1 140 1200000 28.7
6 1 150 1400000 36.2
7 1 160 1600000 37.1")
In this data set I would like to group by period and diagnosis while all deaths(dead) and surv(survival time in days) is summarised in period time. I would also like for a mean value of the age in every period.
Have tried everything, still can't create the data set I'm striving for.
All help is appreciated!
You could try data.table
library(data.table)
as.data.table(work1)[, .(dead_sum=sum(dead),
surv_sum=sum(surv),
age_mean=mean(age)), keyby=.(period, diagnosis)]
Or dplyr
library(dplyr)
work1 %>% group_by(period, diagnosis) %>%
summarise(dead_sum=sum(dead), surv_sum=sum(surv), age_mean=mean(age))
# result
period diagnosis dead_sum surv_sum age_mean
1: 1 1 1 3407 57.00000
2: 2 1 0 3783 25.00000
3: 3 1 0 9262 33.33333
4: 4 1 0 5474 31.50000
5: 5 1 0 2174 25.00000
6: 6 1 0 1104 18.00000
7: 7 1 0 268 19.00000

Resources