sum column by ID [duplicate] - r

This question already has an answer here:
Sum rows based on ID
(1 answer)
Closed 11 months ago.
I have a data frame like this one with several rows for the same subject and with different durations :
ID Duration
1 10
2 15
2 10
3 2
3 5
3 6
I would like to sum up all the durations per subject and put it in a new column to get a data frame like this one:
ID Duration Sum_duration
1 10 10
2 15 25
2 10 25
3 2 13
3 5 13
3 6 13
I thought of using these functions:
df%>%group_by(id) and colSums
but I don't know how to use them in my case.
Thanks in advance for your help

Using dplyr
library(dplyr)
df %>% group_by(ID) %>% mutate(Sum_duration = sum(Duration))
ID Duration sum
<dbl> <dbl> <dbl>
1 1 10 10
2 2 15 25
3 2 10 25
4 3 2 13
5 3 5 13
6 3 6 13
Data
df = data.frame('ID' = c(1,2,2,3,3,3), "Duration" = c(10,15,10,2,5,6))

Using ave
df$Sum_duration=ave(df$Duration,df$ID,FUN=sum)
ID Duration Sum_duration
1 1 10 10
2 2 15 25
3 2 10 25
4 3 2 13
5 3 5 13
6 3 6 13

Related

Adding lines in data frame for each observation

I have a data structure in long format, meaning that each individual has more than one observation (and each observation has one row). Now each individual has a different number of observations. I would like to structure my data in the way, that each individual will have the same number of observations. Therefore, it would be great to find the individual with the most observations and add lines with LOCF (depending on the number of missing lines).
For example:
# simulate data structure
d <- data.frame(
id = c(1,1,1,2,2,3,3,3,3,3),
value = c(10,11,12,5,9,55,14,12,20,7) )
Now individual 3 has the most observations (count = 5). I would like to add two lines for individual 1 (with 12 for value) and three lines for individual 2 (with 9 for value)
Any ideas?
Best wishes and thank you.
In case you wish to carry forward the last value for each individual you could do
d$seq=ave(d$id,d$id,FUN=seq_along)
d=merge(
d,
merge(
aggregate(value~id,data=d,FUN=tail,1),
data.frame("seq"=1:max(table(d$id))),
how="cross"
),
by=c("id","seq"),
all.y=T
)
d$value=ifelse(is.na(d$value.x),d$value.y,d$value.x)
d=d[,!grepl("value.",colnames(d))]
id seq value
1 1 1 10
2 1 2 11
3 1 3 12
4 1 4 12
5 1 5 12
6 2 1 5
7 2 2 9
8 2 3 9
9 2 4 9
10 2 5 9
11 3 1 55
12 3 2 14
13 3 3 12
14 3 4 20
15 3 5 7
Here's a tidyverse solution. If we create a variable to hold the within ID count using seq_along then we can use complete and fill to expand the table and fill in the missing values.
d |> group_by(id) |>
mutate(n = seq_along(value)) |>
ungroup() |>
complete(id, n) |>
fill(value) |>
select(-n)
# A tibble: 15 × 2
id value
<dbl> <dbl>
1 1 10
2 1 11
3 1 12
4 1 12
5 1 12
6 2 5
7 2 9
8 2 9
9 2 9
10 2 9
11 3 55
12 3 14
13 3 12
14 3 20
15 3 7

How to break ties in a ranking with gaps in ranking [duplicate]

This question already has answers here:
Increment by one to each duplicate value
(4 answers)
Closed 1 year ago.
Say that I have these data:
data <- data.frame(orig=c(1,5,5,5,14,18,18,25))
orig
1 1
2 5
3 5
4 5
5 14
6 18
7 18
8 25
I would like to create the want column:
orig want
1 1 1
2 5 5
3 5 6
4 5 7
5 14 14
6 18 18
7 18 19
8 25 25
This column takes orig and copies its value, but breaks ties if they exist. What I am trying to do is to re-create the rankings so that there are no ties and the ties are broken based on the order of the rows in the dataset. If not for the spaces in the rankings (jump from 1 to 5, etc.), I could use
library(tidyverse)
data %>% mutate(test = rank(orig, ties.method="min"))
But this of course doesn't get me what I want:
orig test
1 1 1
2 5 2
3 5 2
4 5 2
5 14 5
6 18 6
7 18 6
8 25 8
What can I do?
We may add row_number() after grouping
library(dplyr)
data %>%
group_by(orig) %>%
mutate(want = orig + row_number() - 1) %>%
ungroup
-ouptut
# A tibble: 8 x 2
orig want
<dbl> <dbl>
1 1 1
2 5 5
3 5 6
4 5 7
5 14 14
6 18 18
7 18 19
8 25 25
Or may simplify with rowid from data.table
library(data.table)
data %>%
mutate(want = orig + rowid(orig)-1)
A base R option using ave + seq_along
transform(
data,
want = orig + ave(orig, orig, FUN = seq_along) - 1
)
gives
orig want
1 1 1
2 5 5
3 5 6
4 5 7
5 14 14
6 18 18
7 18 19
8 25 25

Is there a way to remove duplicates based on two columns but keep the one with highest number in the third column? [duplicate]

This question already has answers here:
Select the row with the maximum value in each group
(19 answers)
Closed 2 years ago.
I would like to take this dataset and remove the values if they have the same id and age(duplicates) but keep the one with the highest month number.
ID|Age|Month|
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3
And have the outcome be
ID|Age|Month
1 25 12
2 18 11
3 12 10
3 25 10
4 19 10
5 10 3
Note that it removed the duplicates but kept the version with the highest month number.
as a solution option
library(tidyverse)
df <- read.table(text = "ID Age Month
1 25 7
1 25 12
2 18 10
2 18 11
3 12 10
3 25 10
4 19 10
5 10 2
5 10 3", header = T)
df %>%
group_by(ID, Age) %>%
slice_max(Month)
#> # A tibble: 6 x 3
#> # Groups: ID, Age [6]
#> ID Age Month
#> <int> <int> <int>
#> 1 1 25 12
#> 2 2 18 11
#> 3 3 12 10
#> 4 3 25 10
#> 5 4 19 10
#> 6 5 10 3
Created on 2021-02-11 by the reprex package (v1.0.0)
Using dplyr package, the solution:
df %>%
+ group_by(ID, Age) %>%
+ filter(Month == max(Month))
# A tibble: 6 x 3
# Groups: ID, Age [6]
ID Age Month
<dbl> <dbl> <dbl>
1 1 25 12
2 2 18 11
3 3 12 10
4 3 25 10
5 4 19 10
6 5 10 3

dplyr solution to split dataset, but keep IDs in same splits

I'm looking for a dplyr or tidyr solution to split a dataset into n chunks. However, I do not want to have any single ID go into multiple chunks. That is, each ID should appear in only one chunk.
For example, imagine "test" below is an ID variable, and the dataset has many other columns.
test<-data.frame(id= c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
val = 1:16)
out <- test %>% select(id) %>% ntile(n = 3)
out
[1] 1 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3
The ID=4 would end up in chunks 1 and 2. I am wondering how to code this so that all ID=4 end up in the same chunk (doesn't matter which one). I looked at the split function but could not find a way to do this.
The desired output would be something like
test[which(out==1),]
returning
id val
1 1 1
2 2 2
3 3 3
4 4 4
5 4 5
6 4 6
7 4 7
8 4 8
Then if I wanted to look at the second chunk, I would call something like test[which(out==2),], and so on up to out==n. I only want to deal with one chunk at a time. I don't need to create all n chunks simultaneously.
You need to create a data frame, then use group_by and mutate to add columns:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
mutate(new_column = ntile(id,3))
out
# A tibble: 16 x 3
id value new_column
<dbl> <int> <int>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 4 1
5 4 5 1
6 4 6 1
7 4 7 2
8 4 8 2
9 6 9 2
10 7 10 2
11 8 11 2
12 9 12 3
13 9 13 3
14 9 14 3
15 9 15 3
16 10 16 3
Or given Frank's comment you could run the ntile function on distinct/unique values of the id - then join the original table back on id:
test<-data_frame(id = c(1,2,3,4,4,4,4,4,6,7,8,9,9,9,9,10),
value = 1:16)
out <- test %>%
distinct(id) %>%
mutate(new_column = ntile(id,3)) %>%
right_join(test, by = "id")
out
# A tibble: 16 x 3
id new_column value
<dbl> <int> <int>
1 1 1 1
2 2 1 2
3 3 1 3
4 4 2 4
5 4 2 5
6 4 2 6
7 4 2 7
8 4 2 8
9 6 2 9
10 7 2 10
11 8 3 11
12 9 3 12
13 9 3 13
14 9 3 14
15 9 3 15
16 10 3 16

How to generate an uneven sequence of numbers in R

Here's an example data frame:
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
I want to generate a sequence of numbers according to the number of observations of y per x group (e.g. there are 2 observations of y for x=1). I want the sequence to be continuously increasing and jumps by 2 after each x group.
The desired output for this example would be:
1,2,5,6,7,10,11,14,17,20,21,22,25,26
How can I do this simply in R?
To expand on my comment, the groupings can be arbitrary, you simply need to recast it to the correct ordering. There are a few ways to do this, #akrun has shown that this can be accomplished using match function, or you can make use the the as.numeric function if this is easier to understand for yourself.
df <- data.frame(x=c(1,1,2,2,2,3,3,4,5,6,6,6,9,9),y=c(1,2,3,4,6,3,7,8,6,4,3,7,3,2))
# these are equivalent
df$newx <- as.numeric(factor(df$x, levels=unique(df$x)))
df$newx <- match(df$x, unique(df$x))
Since you now have a "new" releveling which is sequential, we can use the logic that was discussed in the comments.
df$newNumber <- 1:nrow(df) + (df$newx-1)*2
For this example, this will result in the following dataframe:
x y newx newNumber
1 1 1 1
1 2 1 2
2 3 2 5
2 4 2 6
2 6 2 7
3 3 3 10
3 7 3 11
4 8 4 14
5 6 5 17
6 4 6 20
6 3 6 21
6 7 6 22
9 3 7 25
9 2 7 26
where df$newNumber is the output you wanted.
To create the sequence 0,0,4,4,4,9,..., basically what you're doing is taking the minimum of each group and subtracting 1. The easiest way to do this is using the library(dplyr).
library(dplyr)
df %>%
group_by(x) %>%
mutate(newNumber2 = min(newNumber) -1)
Which will have the output:
Source: local data frame [14 x 5]
Groups: x
x y newx newNumber newNumber2
1 1 1 1 1 0
2 1 2 1 2 0
3 2 3 2 5 4
4 2 4 2 6 4
5 2 6 2 7 4
6 3 3 3 10 9
7 3 7 3 11 9
8 4 8 4 14 13
9 5 6 5 17 16
10 6 4 6 20 19
11 6 3 6 21 19
12 6 7 6 22 19
13 9 3 7 25 24
14 9 2 7 26 24

Resources