How to duplicate each row based on a new column? [duplicate] - r

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 2 years ago.
I'm not exactly sure how to ask the question since english isn't my first language. What I want is duplicate each unique id rows 13 times and create a new column which contains rows with value ranging from -8 to 4 to fill those 13 previously duplicated rows. I think my sample data and expected data will provide a better explanation.
sample data:
data <- data.frame(id = seq(1,100,1),
letters = sample(c("A", "B", "C", "D"), replace = TRUE))
> head(data)
id letters
1 1 A
2 2 B
3 3 B
4 4 C
5 5 A
6 6 B
the expected data:
newcol id letters
1 -8 1 A
2 -7 1 A
3 -6 1 A
4 -5 1 A
5 -4 1 A
6 -3 1 A
7 -2 1 A
8 -1 1 A
9 0 1 A
10 1 1 A
11 2 1 A
12 3 1 A
13 4 1 A
14 -8 2 B
15 -7 2 B
16 -6 2 B
17 -5 2 B
So I guess I could say that I want to create a new column wit values ranging from -8 to 4 (so 13 different values) for each unique rows in the id column.
Also if possible I would like to know how to do it in base R in with the data.table package.
Thank you and sorry for my poor grammar.

We can use uncount
library(tidyr)
library(dplyr)
data %>%
uncount(13) %>%
group_by(id) %>%
mutate(newcol = -8:4) %>%
ungroup
Or in base R
data1 <- data[rep(seq_len(nrow(data)), each = 13),]
data1$newcol <- -8:4
Or using data.table
library(data.table)
setDT(data)[rep(seq_len(.N), each = 13)][, newcol := rep(-8:4, length.out = .N)][]

Related

Calculate difference between current row and first row within group [duplicate]

This question already has answers here:
subtract first or second value from each row [duplicate]
(2 answers)
Closed 3 days ago.
I would like to create a new column in my dataset that shows the difference in the values (column b in example dataset) between the current row and the first row within a group (column a in example dataset) in R. How would I go about doing this?
a<-c(1,1,1,1,2,2,2,2)
b<-c(2,4,6,8,10,12,14,16)
have<-as.data.frame(cbind(a,b))
> have
a b
1 2
1 4
1 6
1 8
2 10
2 12
2 14
2 16
> want
a b c
1 2 0
1 4 2
1 6 4
1 8 6
2 10 0
2 12 2
2 14 4
2 16 6
You can use first() to address the first member in the group:
library(dplyr)
as.data.frame(cbind(a,b)) %>%
group_by(a) %>%
mutate(c = b - first(b)) %>%
ungroup()

gather() per grouped variables in R for specific columns

I have a long data frame with players' decisions who worked in groups.
I need to convert the data in such a way that each row (individual observation) would contain all group members decisions (so we basically can see whether they are interdependent).
Let's say the generating code is:
group_id <- c(rep(1, 3), rep(2, 3))
player_id <- c(rep(seq(1, 3), 2))
player_decision <- seq(10,60,10)
player_contribution <- seq(6,1,-1)
df <-
data.frame(group_id, player_id, player_decision, player_contribution)
So the initial data looks like:
group_id player_id player_decision player_contribution
1 1 1 10 6
2 1 2 20 5
3 1 3 30 4
4 2 1 40 3
5 2 2 50 2
6 2 3 60 1
But I need to convert it to wide per each group, but only for some of these variables, (in this example specifically for player_contribution, but in such a way that the rest of the data remains. So the head of the converted data would be:
data.frame(group_id=c(1,1),
player_id=c(1,2),
player_decision=c(10,20),
player_1_contribution=c(6,6),
player_2_contribution=c(5,5),
player_3_contribution=c(4,6)
)
group_id player_id player_decision player_1_contribution player_2_contribution player_3_contribution
1 1 1 10 6 5 4
2 1 2 20 6 5 6
I suspect I need to group_by in dplyr and then somehow gather per group but only for player_contribution (or a vector of variables). But I really have no clue how to approach it. Any hints would be welcome!
Here is solution using tidyr and dplyr.
Make a dataframe with the columns for the players contributions. Then join this dataframe back onto the columns of interest from the original Dataframe.
library(tidyr)
library(dplyr)
wide<-pivot_wider(df, id_cols= - player_decision,
names_from = player_id,
values_from = player_contribution,
names_prefix = "player_contribution_")
answer<-left_join(df[, c("group_id", "player_id", "player_decision") ], wide)
answer
group_id player_id player_decision player_contribution_1 player_contribution_2 player_contribution_3
1 1 1 10 6 5 4
2 1 2 20 6 5 4
3 1 3 30 6 5 4
4 2 1 40 3 2 1
5 2 2 50 3 2 1
6 2 3 60 3 2 1

Reuse value of previous row during dplyr::mutate

I am trying to group events based on their time of occurrence. To achieve this, I simply calculate a diff over the timestamps and want to essentially start a new group if the diff is larger than a certain value. I would have tried like the code below. However, this is not working since the dialog variable is not available during the mutate it is created by.
library(tidyverse)
df <- data.frame(time = c(1,2,3,4,5,510,511,512,513), id = c(1,2,3,4,5,6,7,8,9))
> df
time id
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 510 6
7 511 7
8 512 8
9 513 9
df <- df %>%
mutate(t_diff = c(NA, diff(time))) %>%
# This generates an error as dialog is not available as a variable at this point
mutate(dialog = ifelse(is.na(t_diff), id, ifelse(t_diff >= 500, id, lag(dialog, 1))))
# This is the desired result
> df
time id t_diff dialog
1 1 1 NA 1
2 2 2 1 1
3 3 3 1 1
4 4 4 1 1
5 5 5 1 1
6 510 6 505 6
7 511 7 1 6
8 512 8 1 6
9 513 9 1 6
In words, I want to add a column that points to the first element of each group. Thereby, the groups are distinguished at points at which the diff to the previous element is larger than 500.
Unfortunately, I have not found a clever workaround to achieve this in an efficient way using dplyr. Obviously, iterating over the data.frame with a loop would work, but would be very inefficient.
Is there a way to achieve this in dplyr?

Subset specific row and last row from data frame

I have a data frame which contains data relating to a score of different events. There can be a number of scoring events for one game. What I would like to do, is to subset the occasions when the score goes above 5 or below -5. I would also like to get the last row for each ID. So for each ID, I would have one or more rows depending on whether the score goes above 5 or below -5. My actual data set contains many other columns of information, but if I learn how to do this then I'll be able to apply it to anything else that I may want to do.
Here is a data set
ID Score Time
1 0 0
1 3 5
1 -2 9
1 -4 17
1 -7 31
1 -1 43
2 0 0
2 -3 15
2 0 19
2 4 25
2 6 29
2 9 33
2 3 37
3 0 0
3 5 3
3 2 11
So for this data set, I would hopefully get this output:
ID Score Time
1 -7 31
1 -1 43
2 6 29
2 9 33
2 3 37
3 2 11
So at the very least, for each ID there will be one line printed with the last score for that ID regardless of whether the score goes above 5 or below -5 during the event( this occurs for ID 3).
My attempt can subset when the value goes above 5 or below -5, I just don't know how to write code to get the last line for each ID:
Data[Data$Score > 5 | Data$Score < -5]
Let me know if you need anymore information.
You can use rle to grab the last row for each ID. Check out ?rle for more information about this useful function.
Data2 <- Data[cumsum(rle(Data$ID)$lengths), ]
Data2
# ID Score Time
#6 1 -1 43
#13 2 3 37
#16 3 2 11
To combine the two conditions, use rbind.
Data2 <- rbind(Data[Data$Score > 5 | Data$Score < -5, ], Data[cumsum(rle(Data$ID)$lengths), ])
To get rid of rows that satisfy both conditions, you can use duplicated and rownames.
Data2 <- Data2[!duplicated(rownames(Data2)), ]
You can also sort if desired, of course.
Here's a go at it in data.table, where df is your original data frame.
library(data.table)
setDT(df)
df[df[, c(.I[!between(Score, -5, 5)], .I[.N]), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
We are grouping by ID. The between function finds the values between -5 and 5, and we negate that to get our desired values outside that range. We then use a .I subset to get the indices per group for those. Then .I[.N] gives us the row number of the last entry, per group. We use the V1 column of that result as our row subset for the entire table. You can take unique values if unique rows are desired.
Note: .I[c(which(!between(Score, -5, 5)), .N)] could also be used in the j entry of the first operation. Not sure if it's more or less efficient.
Addition: Another method, one that uses only logical values and will never produce duplicate rows in the output, is
df[df[, .I == .I[.N] | !between(Score, -5, 5), by = ID]$V1]
# ID Score Time
# 1: 1 -7 31
# 2: 1 -1 43
# 3: 2 6 29
# 4: 2 9 33
# 5: 2 3 37
# 6: 3 2 11
Here is another base R solution.
df[as.logical(ave(df$Score, df$ID,
FUN=function(i) abs(i) > 5 | seq_along(i) == length(i))), ]
ID Score Time
5 1 -7 31
6 1 -1 43
11 2 6 29
12 2 9 33
13 2 3 37
16 3 2 11
abs(i) > 5 | seq_along(i) == length(i) constructs a logical vector that returns TRUE for each element that fits your criteria. ave applies this function to each ID. The resulting logical vector is used to select the rows of the data.frame.
Here's a tidyverse solution. Not as concise as some of the above, but easier to follow.
library(tidyverse)
lastrows <- Data %>% group_by(ID) %>% top_n(1, Time)
scorerows <- Data %>% group_by(ID) %>% filter(!between(Score, -5, 5))
bind_rows(scorerows, lastrows) %>% arrange(ID, Time) %>% unique()
# A tibble: 6 x 3
# Groups: ID [3]
# ID Score Time
# <int> <int> <int>
# 1 1 -7 31
# 2 1 -1 43
# 3 2 6 29
# 4 2 9 33
# 5 2 3 37
# 6 3 2 11

Loop over each group and subtract their value [duplicate]

This question already has answers here:
R: Differences by group and adding
(3 answers)
Closed 6 years ago.
I have the following dataset:
df <- data.frame (id= c(1,1,1,2,2), time = c(13,14,17,17,17))
id time
1 1 13
2 1 14
3 1 17
4 2 17
5 2 17
and I wish to go over on each id and subtract the next time and the previous time. So, My ideal output will be:
#output
id time diff
1 1 13 0
2 1 14 1
3 1 17 3
4 2 17 0
5 2 17 0
What is the most efficient way for that?
Thank so Zheyuan Li.
This is a great solution:
df$diff <- with(df, ave(time, id, FUN = function (x) c(0, diff(x))))

Resources