Repeat/duplicate specific row of data frame and append - r

I would like to duplicate a certain row based on information in a data frame. Prefer a tidyverse solution. I'd like to accomplish this without explicitly calling the original data frame in a function.
Here's a toy example.
data.frame(var1 = c("A", "A", "A", "B", "B"),
var2 = c(1, 2, 3, 4, 5),
val = c(21, 31, 54, 65, 76))
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
All the solutions I've found so far require the user to input the desired row index. I'd like to find a way of doing it programmatically. In this case, I would like to duplicate the row where var1 is "A" with the highest value of var2 for "A" and append to the original data frame. The expected output is
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
6 A 3 54

A variation using dplyr. Find the max by group, filter for var1 and append.
library(dplyr)
df %>%
group_by(var1) %>%
filter(var2 == max(var2),
var1 == "A") %>%
bind_rows(df, .)
var1 var2 val
1 A 1 21
2 A 2 31
3 A 3 54
4 B 4 65
5 B 5 76
6 A 3 54

You could select the row that you want to duplicate and add it to original dataframe :
library(dplyr)
var1_variable <- 'A'
df %>%
filter(var1 == var1_variable) %>%
slice_max(var2, n = 1) %>%
#For dplyr < 1.0.0
#slice(which.max(var2)) %>%
bind_rows(df, .)
# var1 var2 val
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 B 4 65
#5 B 5 76
#6 A 3 54
In base R, that can be done as :
df1 <- subset(df, var1 == var1_variable)
rbind(df, df1[which.max(df1$var2), ])
From this post we can save the previous work in a temporary variable and then bind rows so that we don't break the chain and don't bind the original dataframe df.
df %>%
#Previous list of commands
{
{. -> temp} %>%
filter(var1 == var1_variable) %>%
slice_max(var2, n = 1) %>%
bind_rows(temp)
}

In base you can use rbind and subset to append the row(s) where var1 == "A" with the highest value of var2 to the original data frame.
rbind(x, subset(x[x$var1 == "A",], var2 == max(var2)))
# var1 var2 val
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 B 4 65
#5 B 5 76
#31 A 3 54
Data:
x <- data.frame(var1 = c("A", "A", "A", "B", "B"),
var2 = c(1, 2, 3, 4, 5),
val = c(21, 31, 54, 65, 76))

An option with uncount
library(dplyr)
library(tidyr)
df1 %>%
uncount(replace(rep(1, n()), match(max(val[var1 == 'A']), val), 2)) %>%
as_tibble
# A tibble: 6 x 3
# var1 var2 val
# <chr> <dbl> <dbl>
#1 A 1 21
#2 A 2 31
#3 A 3 54
#4 A 3 54
#5 B 4 65
#6 B 5 76

Related

Identify duplicates and make column with common id [duplicate]

This question already has answers here:
Concatenate strings by group with dplyr [duplicate]
(4 answers)
Closed 20 days ago.
I have a df
df <- data.frame(ID = c('a', 'b', 'c', 'd', 'e', 'f', 'g', 'h'),
var1 = c(1, 1, 3, 4, 5, 5, 7, 8),
var2 = c(1, 1, 0, 0, 1, 1, 0, 0),
var3 = c(50, 50, 30, 47, 33, 33, 70, 46))
Where columns var1 - var3 are numerical inputs into a modelling software. To save on computing time, I would like to simulate unique instances of var1 - var3 in the modelling software, then join the results back to the main dataframe using leftjoin.
I need to add a second identifier to each row to show that it is the same as another row in terms of var1-var3. The output would be like:
ID var1 var2 var3 ID2
1 a 1 1 50 ab
2 b 1 1 50 ab
3 c 3 0 30 c
4 d 4 0 47 d
5 e 5 1 33 ef
6 f 5 1 33 ef
7 g 7 0 70 g
8 h 8 0 46 h
The I can subset unique rows of var1-var3 and ID2 simulate them in the software, and join results back to the main df using the new ID2.
With paste:
library(dplyr) #1.1.0
df %>%
mutate(ID2 = paste(unique(ID), collapse = ""),
.by = c(var1, var2))
# ID var1 var2 var3 ID2
# 1 a 1 1 50 ab
# 2 b 1 1 50 ab
# 3 c 3 0 30 c
# 4 d 4 0 47 d
# 5 e 5 1 33 ef
# 6 f 5 1 33 ef
# 7 g 7 0 70 g
# 8 h 8 0 46 h
Note that the .by argument is a new feature of dplyr 1.1.0. You can still use group_by and ungroup with earlier versions and/or if you have a more complex pipeline.

Add a new column with the same value within each group based on value from specific row

I have a data frame with a grouping variable ID, a factor F and a value V that looks something like this:
df <- data.frame(ID = c(rep(1, 3), rep(2, 3)),
F = factor(c("A","B","X","C","D","X")),
V = c(30, 32, 25, 31, 37, 24)
)
> df
ID F V
1 1 A 30
2 1 B 32
3 1 X 25
4 2 C 31
5 2 D 37
6 2 X 24
Now, I would like to add a new column New, which has the same value within each group (by ID) based on the value for V in the row where F==X using the tidyverse environment. Ideally, those rows would be removed afterwards so that the new data frame looks like this:
> df
ID F V New
1 1 A 30 25
2 1 B 32 25
3 2 C 31 24
4 2 D 37 24
I know that I have to use the group_by() function and probably also mutate(), but I couldn't quite manage to get my desired result.
df %>%
group_by(ID) %>%
mutate(New = V[F =='X']) %>%
filter(F != 'X')
# A tibble: 4 × 4
# Groups: ID [2]
ID F V New
<dbl> <fct> <dbl> <dbl>
1 1 A 30 25
2 1 B 32 25
3 2 C 31 24
4 2 D 37 24
library(dplyr)
df %>%
group_by(ID) %>% # grouping variables by ID
mutate(New = ifelse(F == "X",
V,
NA)) %>% # adding New column
summarise(New = max(New, na.rm = T)) %>% # Filtering rows with filled New column
right_join(df %>% filter(F != "X"), by = "ID") %>% # SQL-like join
select(ID, F, V, New) # reordering the columns to the desired order
And you get this output:
# A tibble: 4 × 4
ID F V New
<dbl> <fct> <dbl> <dbl>
1 1 A 30 25
2 1 B 32 25
3 2 C 31 24
4 2 D 37 24
Or even simplier:
df %>% filter(F == "X") %>% # filtering the rows with "X" in F column
right_join(df %>% filter(F != "X"), by = "ID") %>% joining to the same dataset without "X" rows
select(ID, F= F.y, V = V.y, New = V.x) #reordering and renaming of columns

Filtering by conditional values in R

So, my data take the general shape of:
library(tidyverse)
id <- c(1, 1, 2, 2, 3, 3)
group <- c("A", "B", "A", "A", "B", "B")
value <- c(34, 12, 56, 78, 90, 91)
df <- tibble(id, group, value)
df
id group value
<dbl> <chr> <dbl>
1 1 A 34
2 1 B 12
3 2 A 56
4 2 A 78
5 3 B 90
6 3 B 91
What I want to do can be described as "for each id, take the maximum value of group A. But, if A is not there, take the maximum value of group B." So my desired output would look something like:
id group value
<dbl> <chr> <dbl>
1 1 A 34
4 2 A 78
6 3 B 91
I tried to do this using the code...
desired <- df %>%
group_by(id) %>%
filter(if (exists(group == "A")) max(value) else if (exists(group == "B")) (max(value)))
...but I received an error. Help?
One option could be:
df %>%
group_by(id) %>%
arrange(group, desc(value), .by_group = TRUE) %>%
slice(which.max(group == "A"))
id group value
<dbl> <chr> <dbl>
1 1 A 34
2 2 A 78
3 3 B 91
Here is a base R option
subset(
df[order(id, group, -value), ],
ave(rep(TRUE, nrow(df)), id, FUN = function(x) seq_along(x) == 1)
)
which gives
id group value
<dbl> <chr> <dbl>
1 1 A 34
2 2 A 78
3 3 B 91
The basic idea is:
We reorder the rows of df via df[order(id, group, -value), ]
Then we take the first value in the reordered df by id
Using data.table
library(data.table)
setDT(df)[order(id, group, -value), .SD[1], id]
# id group value
#1: 1 A 34
#2: 2 A 78
#3: 3 B 91

Filling in non-existing rows in R + dplyr [duplicate]

This question already has answers here:
Proper idiom for adding zero count rows in tidyr/dplyr
(6 answers)
Closed 2 years ago.
Apologies if this is a duplicate question, I saw some questions which were similar to mine, but none exactly addressing my problem.
My data look basically like this:
FiscalWeek <- as.factor(c(45, 46, 48, 48, 48))
Group <- c("A", "A", "A", "B", "C")
Amount <- c(1, 1, 1, 5, 6)
df <- tibble(FiscalWeek, Group, Amount)
df
# A tibble: 5 x 3
FiscalWeek Group Amount
<fct> <chr> <dbl>
1 45 A 1
2 46 A 1
3 48 A 1
4 48 B 5
5 48 C 6
Note that FiscalWeek is a factor. So, when I take a weekly average by Group, I get this:
library(dplyr)
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount))
averages
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 1
2 B 5
3 C 6
But, this is actually a four-week period. Nothing at all happened in Week 47, and groups B and C didn't show data in weeks 45 and 46, but I still want averages that reflect the existence of those weeks. So I need to fill out my original data with zeroes such that this is my desired result:
DesiredGroup <- c("A", "B", "C")
DesiredAvgs <- c(0.75, 1.25, 1.5)
Desired <- tibble(DesiredGroup, DesiredAvgs)
Desired
# A tibble: 3 x 2
DesiredGroup DesiredAvgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
What is the best way to do this using dplyr?
Up front: missing data to me is very different from 0. I'm assuming that you "know" with certainty that missing data should bring all other values down.
The name FiscalWeek suggests that it is an integer-like data, but your use of factor suggests ordinal or categorical. Because of that, you need to define authoritatively what the complete set of factors can be. And because your current factor does not contain all possible levels, I'll infer them (you need to adjust your all_groups_weeks accordingly:
all_groups_weeks <- tidyr::expand_grid(FiscalWeek = as.factor(45:48), Group = c("A", "B", "C"))
all_groups_weeks
# # A tibble: 12 x 2
# FiscalWeek Group
# <fct> <chr>
# 1 45 A
# 2 45 B
# 3 45 C
# 4 46 A
# 5 46 B
# 6 46 C
# 7 47 A
# 8 47 B
# 9 47 C
# 10 48 A
# 11 48 B
# 12 48 C
From here, join in the full data in order to "complete" it. Using tidyr::complete won't work because you don't have all possible values in the data (47 missing).
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0))
# # A tibble: 12 x 3
# FiscalWeek Group Amount
# <fct> <chr> <dbl>
# 1 45 A 1
# 2 46 A 1
# 3 48 A 1
# 4 48 B 5
# 5 48 C 6
# 6 45 B 0
# 7 45 C 0
# 8 46 B 0
# 9 46 C 0
# 10 47 A 0
# 11 47 B 0
# 12 47 C 0
full_join(df, all_groups_weeks, by = c("FiscalWeek", "Group")) %>%
mutate(Amount = coalesce(Amount, 0)) %>%
group_by(Group) %>%
summarize(Avgs = mean(Amount, na.rm = TRUE))
# # A tibble: 3 x 2
# Group Avgs
# <chr> <dbl>
# 1 A 0.75
# 2 B 1.25
# 3 C 1.5
You can try this. I hope this helps.
library(dplyr)
#Define range
df %>% mutate(FiscalWeek=as.numeric(as.character(FiscalWeek))) -> df
range <- length(seq(min(df$FiscalWeek),max(df$FiscalWeek),by=1))
#Aggregation
averages <- df %>%
group_by(Group) %>%
summarize(Avgs = sum(Amount)/range)
# A tibble: 3 x 2
Group Avgs
<chr> <dbl>
1 A 0.75
2 B 1.25
3 C 1.5
You can do it without filling if you know number of weeks:
df %>%
group_by(Group) %>%
summarise(Avgs = sum(Amount) / length(45:48))

R insert week number from vector and perform na.locf afterwards

For a dataframe similar to below (but much larger obviously)) I want to add missing week numbers from a vector ( vector is named weeks below). In the end, each value for var1 should have 4 rows consisting of week 40 - 42 so the value inserted for week can be different for different values of var1. Initially the inserted rows can have value NA but as a second step I would like to perform na.locf for each value of var1. does anyone know how to do this?
Data frame example:
dat <- data.frame(var1 = rep(c('a','b','c','d'),3),
week = c(rep(40,4),rep(41,4),rep(42,4)),
value = c(2,3,3,2,4,5,5,6,8,9,10,10))
dat <- dat[-c(6,11), ]
weeks <- c(40:42)
Like this?
dat %>%
tidyr::complete(var1,week) %>%
group_by(var1) %>%
arrange(week) %>%
tidyr::fill(value)
# A tibble: 12 x 3
# Groups: var1 [4]
var1 week value
<fct> <dbl> <dbl>
1 a 40 2
2 a 41 4
3 a 42 8
4 b 40 3
5 b 41 3
6 b 42 9
7 c 40 3
8 c 41 5
9 c 42 5
10 d 40 2
11 d 41 6
12 d 42 10
Hi have you considered tidyr::complete and dplyr::fill().
library(dplyr)
library(tidyr)
complete(dat, week = 40:42, var1 = c("a", "b", "c", "d")) %>% fill(value, .direction =
"down")

Resources