Duplicate rows in dataframe - r

I have a data.frame which looks like so:
df <- data.frame(id=c("001","002","003","004"),year=c(2015,2015,2015,2015),
x1=c(15,20,25,30),x2=c(1,2,3,4))
id year x1 x2
001 2015 15 1
002 2015 20 2
003 2015 25 3
004 2015 30 4
I would like to duplicate id, x1, and x2 but change the year to end up with a data.frame that resembles the following:
id year x1 x2
001 2015 15 1
002 2015 20 2
003 2015 25 3
004 2015 30 4
001 2016 15 1
002 2016 20 2
003 2016 25 3
004 2016 30 4
I can achieve this by doing
df2 <- df %>%
mutate(year = 2016)
df3 <- rbind(df, df2)
But I am wondering if there is a more intuitive way, so that I can create duplicates for 20+ years without needing to make multiple new data.frames?

df <- data.frame(id=c("001","002","003","004"),year=c(2015,2015,2015,2015),
x1=c(15,20,25,30),x2=c(1,2,3,4))
library(tidyr)
df %>% complete(nesting(id, x1, x2), year = 2015:2016)
#> # A tibble: 8 x 4
#> id x1 x2 year
#> <chr> <dbl> <dbl> <dbl>
#> 1 001 15 1 2015
#> 2 001 15 1 2016
#> 3 002 20 2 2015
#> 4 002 20 2 2016
#> 5 003 25 3 2015
#> 6 003 25 3 2016
#> 7 004 30 4 2015
#> 8 004 30 4 2016
For extra years you just need to change 2015:2016 according to your need. You may also use dynamic referencing here using seq

library(tidyverse)
df <- data.frame(id=c("001","002","003","004"),year=c(2015,2015,2015,2015),
x1=c(15,20,25,30),x2=c(1,2,3,4))
map_dfr(0:1, ~mutate(df, year = year + .x))
#> id year x1 x2
#> 1 001 2015 15 1
#> 2 002 2015 20 2
#> 3 003 2015 25 3
#> 4 004 2015 30 4
#> 5 001 2016 15 1
#> 6 002 2016 20 2
#> 7 003 2016 25 3
#> 8 004 2016 30 4
Created on 2021-06-16 by the reprex package (v2.0.0)

Related

Repeating annual values multiple times to form a monthly dataframe

I have an annual dataset as below:
year <- c(2016,2017,2018)
xxx <- c(1,2,3)
yyy <- c(4,5,6)
df <- data.frame(year,xxx,yyy)
print(df)
year xxx yyy
1 2016 1 4
2 2017 2 5
3 2018 3 6
Where the values in column xxx and yyy correspond to values for that year.
I would like to expand this dataframe (or create a new dataframe), which retains the same column names, but repeats each value 12 times (corresponding to the month of that year) and repeat the yearly value 12 times in the first column.
As mocked up by the code below:
year <- rep(2016:2018,each=12)
xxx <- rep(1:3,each=12)
yyy <- rep(4:6,each=12)
df2 <- data.frame(year,xxx,yyy)
print(df2)
year xxx yyy
1 2016 1 4
2 2016 1 4
3 2016 1 4
4 2016 1 4
5 2016 1 4
6 2016 1 4
7 2016 1 4
8 2016 1 4
9 2016 1 4
10 2016 1 4
11 2016 1 4
12 2016 1 4
13 2017 2 5
14 2017 2 5
15 2017 2 5
16 2017 2 5
17 2017 2 5
18 2017 2 5
19 2017 2 5
20 2017 2 5
21 2017 2 5
22 2017 2 5
23 2017 2 5
24 2017 2 5
25 2018 3 6
26 2018 3 6
27 2018 3 6
28 2018 3 6
29 2018 3 6
30 2018 3 6
31 2018 3 6
32 2018 3 6
33 2018 3 6
34 2018 3 6
35 2018 3 6
36 2018 3 6
Any help would be greatly appreciated!
I'm new to R and I can see how I would do this with a loop statement but was wondering if there was an easier solution.
Convert df to a matrix, take the kronecker product with a vector of 12 ones and then convert back to a data.frame. The as.data.frame can be omitted if a matrix result is ok.
as.data.frame(as.matrix(df) %x% rep(1, 12))

Moving average by multiple group

I have a following DF (demo). I would like to find the previous 3 month moving average of Amount column per ID, Year and Month.
ID YEAR MONTH AMOUNT
1 ABC 2020 09 100
2 ABC 2020 11 200
3 ABC 2020 12 300
4 ABC 2021 01 400
5 ABC 2021 04 500
6 PQR 2020 10 100
7 PQR 2020 11 200
8 PQR 2020 12 300
9 PQR 2021 01 400
10 PQR 2021 03 500
Following is an attempt.
library(TTR)
library(dplyr)
DF %>% group_by(ID, YEAR, MONTH) %>% mutate(3MA = runMean(AMOUNT, 3))
resulting in error with n=3 is outside valid range.
Desired Output:
ID YEAR MONTH AMOUNT 3MA
1 ABC 2020 09 100 NA
2 ABC 2020 11 200 NA
3 ABC 2020 12 300 NA
4 ABC 2021 01 400 200 (100+200+300)/3
5 ABC 2021 04 500 300 (400+300+200)/3
6 PQR 2020 10 100 NA
7 PQR 2020 11 200 NA
8 PQR 2020 12 300 NA
9 PQR 2021 01 400 200 (100+200+300)/3
10 PQR 2021 03 500 300 (400+300+200)/3
You can use the following code:
library(dplyr)
arrange(DF,ID,YEAR) %>%
group_by(ID) %>%
mutate(lag1=lag(AMOUNT),
lag2=lag(AMOUNT,2),
lag3=lag(AMOUNT,3),
movave=(lag1+lag2+lag3)/3)
#> # A tibble: 10 × 8
#> # Groups: ID [2]
#> ID YEAR MONTH AMOUNT lag1 lag2 lag3 movave
#> <chr> <int> <int> <int> <int> <int> <int> <dbl>
#> 1 ABC 2020 9 100 NA NA NA NA
#> 2 ABC 2020 11 200 100 NA NA NA
#> 3 ABC 2020 12 300 200 100 NA NA
#> 4 ABC 2021 1 400 300 200 100 200
#> 5 ABC 2021 4 500 400 300 200 300
#> 6 PQR 2020 10 100 NA NA NA NA
#> 7 PQR 2020 11 200 100 NA NA NA
#> 8 PQR 2020 12 300 200 100 NA NA
#> 9 PQR 2021 1 400 300 200 100 200
#> 10 PQR 2021 3 500 400 300 200 300
Created on 2022-07-02 by the reprex package (v2.0.1)
An option using a sliding window:
library(tidyverse)
library(slider)
df <- tribble(
~id, ~year, ~month, ~amount,
"ABC", 2020, 09, 100,
"ABC", 2020, 11, 200,
"ABC", 2020, 12, 300,
"ABC", 2021, 01, 400,
"ABC", 2021, 04, 500,
"PQR", 2020, 10, 100,
"PQR", 2020, 11, 200,
"PQR", 2020, 12, 300,
"PQR", 2021, 01, 400,
"PQR", 2021, 03, 500
)
df |>
arrange(id, year, month) |>
group_by(id) |>
mutate(ma3 = slide_dbl(lag(amount), mean, .before = 2, complete = TRUE)) |>
ungroup() # if needed
#> # A tibble: 10 × 5
#> id year month amount ma3
#> <chr> <dbl> <dbl> <dbl> <dbl>
#> 1 ABC 2020 9 100 NA
#> 2 ABC 2020 11 200 NA
#> 3 ABC 2020 12 300 NA
#> 4 ABC 2021 1 400 200
#> 5 ABC 2021 4 500 300
#> 6 PQR 2020 10 100 NA
#> 7 PQR 2020 11 200 NA
#> 8 PQR 2020 12 300 NA
#> 9 PQR 2021 1 400 200
#> 10 PQR 2021 3 500 300
Created on 2022-07-02 by the reprex package (v2.0.1)
Here is a way.
suppressPackageStartupMessages({
library(dplyr)
library(TTR)
})
x <- ' ID YEAR MONTH AMOUNT
1 ABC 2020 09 100
2 ABC 2020 11 200
3 ABC 2020 12 300
4 ABC 2021 01 400
5 ABC 2021 04 500
6 PQR 2020 10 100
7 PQR 2020 11 200
8 PQR 2020 12 300
9 PQR 2021 01 400
10 PQR 2021 03 500 '
DF <- read.table(textConnection(x), header = TRUE)
DF %>%
arrange(ID, YEAR, MONTH) %>%
group_by(ID) %>%
mutate(`3MA` = lag(runMean(AMOUNT, 3)))
#> # A tibble: 10 × 5
#> # Groups: ID [2]
#> ID YEAR MONTH AMOUNT `3MA`
#> <chr> <int> <int> <int> <dbl>
#> 1 ABC 2020 9 100 NA
#> 2 ABC 2020 11 200 NA
#> 3 ABC 2020 12 300 NA
#> 4 ABC 2021 1 400 200
#> 5 ABC 2021 4 500 300
#> 6 PQR 2020 10 100 NA
#> 7 PQR 2020 11 200 NA
#> 8 PQR 2020 12 300 NA
#> 9 PQR 2021 1 400 200
#> 10 PQR 2021 3 500 300
Created on 2022-07-02 by the reprex package (v2.0.1)
Try this
DF |> arrange(ID , YEAR , MONTH) |> group_by(ID) |>
mutate(`3M` = (lag(AMOUNT) + lag(AMOUNT ,2) + lag(AMOUNT , 3)) / 3)
output
# A tibble: 10 × 5
# Groups: ID [2]
ID YEAR MONTH AMOUNT `3M`
<chr> <int> <int> <int> <dbl>
1 ABC 2020 9 100 NA
2 ABC 2020 11 200 NA
3 ABC 2020 12 300 NA
4 ABC 2021 1 400 200
5 ABC 2021 4 500 300
6 PQR 2020 10 100 NA
7 PQR 2020 11 200 NA
8 PQR 2020 12 300 NA
9 PQR 2021 1 400 200
10 PQR 2021 3 500 300

How to add a new column with values specific to grouped variables

I'm new to R and have found similar solutions to my problem, but I'm struggling to apply these to my code. Please help...
These data are simplified, as the id variables are many:
df = data.frame(id = rep(c("a_10", "a_11", "b_10", "b_11"), each = 5),
site = rep(1:5, 4),
value = sample(1:20))
The aim is to add another column labelled "year" with values that are grouped by "id" but the true names are many - so I'm trying to simplify the code by using the ending digits.
I can use dplyr to split the dataframe into each id variable using this code (repeated for each id variable):
df %>%
select(site, id, value) %>%
filter(grepl("10$", id)) %>%
mutate(Year = "2010")`
Rather than using merge to re-combine the dataframes back into one, is there not a more simple method?
I tried modifying case_when with mutate as described in a previous answer:
[https://stackoverflow.com/a/63043920/12313457][1]
mutate(year = case_when(grepl(c("10$", "11$", id) == c("2010", "2011"))))
is something like this possible??
Thanks in advance
In case your id column has different string lengths you can use sub:
df %>%
mutate(Year = paste0("20", sub('^.*_(\\d+)$', '\\1', id)))
#> id site value Year
#> 1 a_10 1 2 2010
#> 2 a_10 2 7 2010
#> 3 a_10 3 16 2010
#> 4 a_10 4 10 2010
#> 5 a_10 5 11 2010
#> 6 a_11 1 5 2011
#> 7 a_11 2 13 2011
#> 8 a_11 3 14 2011
#> 9 a_11 4 6 2011
#> 10 a_11 5 12 2011
#> 11 b_10 1 17 2010
#> 12 b_10 2 1 2010
#> 13 b_10 3 4 2010
#> 14 b_10 4 15 2010
#> 15 b_10 5 9 2010
#> 16 b_11 1 8 2011
#> 17 b_11 2 20 2011
#> 18 b_11 3 19 2011
#> 19 b_11 4 18 2011
#> 20 b_11 5 3 2011
Created on 2022-04-21 by the reprex package (v2.0.1)
You can use substr to get the final two digits of id and then paste0 this to "20" to recreate the year.
df |> dplyr::mutate(Year = paste0("20", substr(id, 3, 4)))
#> id site value Year
#> 1 a_10 1 5 2010
#> 2 a_10 2 12 2010
#> 3 a_10 3 9 2010
#> 4 a_10 4 7 2010
#> 5 a_10 5 13 2010
#> 6 a_11 1 3 2011
#> 7 a_11 2 4 2011
#> 8 a_11 3 16 2011
#> 9 a_11 4 2 2011
#> 10 a_11 5 6 2011
#> 11 b_10 1 19 2010
#> 12 b_10 2 14 2010
#> 13 b_10 3 15 2010
#> 14 b_10 4 10 2010
#> 15 b_10 5 11 2010
#> 16 b_11 1 18 2011
#> 17 b_11 2 1 2011
#> 18 b_11 3 20 2011
#> 19 b_11 4 17 2011
#> 20 b_11 5 8 2011
Created on 2022-04-21 by the reprex package (v2.0.1)

R: Count unique rows for each unique name per year

This is my first question on stackoverflow. I searched for similar questions but I didn't find an answer.
I know that the question in the title isn't clear but I hope you are going to understand what I want as output.
I have a dataframe that looks like this:
ID Name Year
1 1 Anas 2018
2 1 Carl 2018
3 1 Catherine 2018
4 2 Anas 2018
5 2 Carl 2018
6 3 Catherine 2018
7 3 Julien 2018
8 4 Raul 2018
9 4 Ahmed 2018
10 4 Laurence 2018
11 4 Carl 2018
12 5 Anas 2019
13 5 Georges 2019
14 5 Arman 2019
15 6 Anas 2019
16 6 Pietro 2019
17 7 Pietro 2019
18 8 Diego 2019
if the names in the column "Name" have the same ID, it means that are collaborators in a project.
I want to add a column with the number of UNIQUE collaborators per year for each name (by including each name in the count of his collborators)
The output should look like this: (I added the last column to explain how to count-Not needed)
ID Name Year Unique_Coll explication
1 1 Anas 2018 3 (Anas, Carl, Catherine)
2 1 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
3 1 Catherine 2018 4 (Catherine, Carl, Anas, Julien)
4 2 Anas 2018 3 (Anas, Carl, Catherine)
5 2 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
6 3 Catherine 2018 4 (Catherine, Carl, Anas, Julien)
7 3 Julien 2018 2 (Julien, Catherine)
8 4 Raul 2018 4 (Raul, Ahmed, Laurence, Carl)
9 4 Ahmed 2018 4 (Ahmed, Raul, Laurence, Carl)
10 4 Laurence 2018 4 (Laurence, Raul, Ahmed, Carl)
11 4 Carl 2018 6 (Carle, Anas, Catherine, Laurence, Ahmed, Raul)
12 5 Anas 2019 4 (Anas, Georges, Arman, Pietro)
13 5 Georges 2019 3 (Georges, Anas, Arman)
14 5 Arman 2019 3 (Arman Anas, Georges)
15 6 Anas 2019 4 (Anas, Georges, Arman, Pietro)
16 6 Pietro 2019 2 (Pietro, Anas)
17 7 Pietro 2019 2 (Pietro, Anas)
18 8 Diego 2019 1 (Diego)
Thank you
You could construct a variable that would be a list of names and count the number of unique names in the following way:
library(dplyr)
df = df %>%
group_by(ID) %>%
mutate(group = list(Name)) %>%
group_by(Year,Name) %>%
mutate(n = n_distinct(unlist(list(group)))) %>%
select(-group)
# A tibble: 18 x 4
# Groups: Year, Name [12]
ID Name Year n
<int> <chr> <int> <int>
1 1 Anas 2018 3
2 1 Carl 2018 6
3 1 Catherine 2018 4
4 2 Anas 2018 3
5 2 Carl 2018 6
6 3 Catherine 2018 4
7 3 Julien 2018 2
8 4 Raul 2018 4
9 4 Ahmed 2018 4
10 4 Laurence 2018 4
11 4 Carl 2018 6
12 5 Anas 2019 4
13 5 Georges 2019 3
14 5 Arman 2019 3
15 6 Anas 2019 4
16 6 Pietro 2019 2
17 7 Pietro 2019 2
18 8 Diego 2019 1
The following solution uses dplyr to first join all collaborators to every Name, creating a column Name_collab (note that this expands the data frame and could blow it up if it were large). Then, we count the distinct Name_collab for every Name, Year combination and get rid of duplicates.
library(dplyr)
df %>%
left_join(df, by = c("ID", "Year"), suffix = c("", "_collab")) %>%
group_by(Name, Year) %>%
mutate(Unique_Coll = n_distinct(Name_collab)) %>%
ungroup() %>%
distinct(ID, Name, Year, Unique_Coll)
which gives
# A tibble: 18 x 4
ID Name Year Unique_Coll
<int> <fct> <int> <int>
1 1 Anas 2018 3
2 1 Carl 2018 6
3 1 Catherine 2018 4
4 2 Anas 2018 3
5 2 Carl 2018 6
6 3 Catherine 2018 4
7 3 Julien 2018 2
8 4 Raul 2018 4
9 4 Ahmed 2018 4
10 4 Laurence 2018 4
11 4 Carl 2018 6
12 5 Anas 2019 4
13 5 Georges 2019 3
14 5 Arman 2019 3
15 6 Anas 2019 4
16 6 Pietro 2019 2
17 7 Pietro 2019 2
18 8 Diego 2019 1
Input:
df <- read.table(text="ID Name Year
1 1 Anas 2018
2 1 Carl 2018
3 1 Catherine 2018
4 2 Anas 2018
5 2 Carl 2018
6 3 Catherine 2018
7 3 Julien 2018
8 4 Raul 2018
9 4 Ahmed 2018
10 4 Laurence 2018
11 4 Carl 2018
12 5 Anas 2019
13 5 Georges 2019
14 5 Arman 2019
15 6 Anas 2019
16 6 Pietro 2019
17 7 Pietro 2019
18 8 Diego 2019")
I have a solution using joins.
library(tidyverse)
# read data
dta <- tribble(~ID, ~Name, ~Year,
1, "Anas", 2018,
1, "Carl", 2018,
1, "Catherine", 2018,
2, "Anas", 2018,
2, "Carl", 2018,
3, "Catherine", 2018,
3, "Julien", 2018,
4, "Raul", 2018,
4, "Ahmed", 2018,
4, "Laurence", 2018,
4, "Carl", 2018,
5, "Anas", 2019,
5, "Georges", 2019,
5, "Arman", 2019,
6, "Anas", 2019,
6, "Pietro", 2019,
7, "Pietro", 2019,
8, "Diego", 2019)
nb_collabs <- dta %>%
left_join(dta, by = c("ID", "Year")) %>%
select(-ID) %>%
group_by(Name.x, Year) %>%
nest(collaborators = Name.y) %>%
mutate(unique_collaborators = map(collaborators, distinct),
Unique_Coll = map_int(unique_collaborators, nrow)) %>%
select(-collaborators, -unique_collaborators)
left_join(dta, nb_collabs, by = c("Name"="Name.x", "Year"))
# A tibble: 18 x 4
# ID Name Year Unique_Coll
# <dbl> <chr> <dbl> <int>
# 1 1 Anas 2018 3
# 2 1 Carl 2018 6
# 3 1 Catherine 2018 4
# 4 2 Anas 2018 3
# 5 2 Carl 2018 6
# 6 3 Catherine 2018 4
# 7 3 Julien 2018 2
# 8 4 Raul 2018 4
# 9 4 Ahmed 2018 4
#10 4 Laurence 2018 4
#11 4 Carl 2018 6
#12 5 Anas 2019 4
#13 5 Georges 2019 3
#14 5 Arman 2019 3
#15 6 Anas 2019 4
#16 6 Pietro 2019 2
#17 7 Pietro 2019 2
#18 8 Diego 2019 1
So the first step is to join the data with itself. The point is to have the name in "Name.x", and a separate row for each collaborator as "Name.y". Then we can nest the collaborator names, so that we get a data frame with one row for each Name, with a nested data frame with the collaborators, so we just need to remove the duplicates and count the number of persons.
In nb_collabs we have a table with each person and the number of collaborators, we can simply join it back with the original data frame to get the desired format.

Performing a dplyr full_join without a common variable to blend data frames

Using the dplyr full_join() operation, I am trying to perform the equivalent of a basic merge() operation in which no common variables exist (unable to satisfy the "by=" argument). This will blend two data frames and return all possible combinations.
However, the current full_join() function requires a common variable. I am unable to locate another dplyr function that can help with this. How can I perform this operation using functions specific to the dplyr library?
df_a = data.frame(department=c(1,2,3,4))
df_b = data.frame(period=c(2014,2015,2016,2017))
#This works as desired
big_df = merge(df_a,df_b)
#I'd like to perform the following in a much bigger operation:
big_df = dplyr::full_join(df_a,df_b)
#Error: No common variables. Please specify `by` param.
You can use crossing from tidyr:
crossing(df_a,df_b)
department period
1 1 2014
2 1 2015
3 1 2016
4 1 2017
5 2 2014
6 2 2015
7 2 2016
8 2 2017
9 3 2014
10 3 2015
11 3 2016
12 3 2017
13 4 2014
14 4 2015
15 4 2016
16 4 2017
If there are duplicate rows, crossing doesn't give the same result as merge.
Instead use full_join with by = character() to perform a cross-join which generates all combinations of df_a and df_b.
library("tidyverse") # version 1.3.2
# Add duplicate rows for illustration.
df_a <- tibble(department = c(1, 2, 3, 3))
df_b <- tibble(period = c(2014, 2015, 2016, 2017))
merge doesn't de-duplicate.
df_a_merge_b <- merge(df_a, df_b)
df_a_merge_b
#> department period
#> 1 1 2014
#> 2 2 2014
#> 3 3 2014
#> 4 3 2014
#> 5 1 2015
#> 6 2 2015
#> 7 3 2015
#> 8 3 2015
#> 9 1 2016
#> 10 2 2016
#> 11 3 2016
#> 12 3 2016
#> 13 1 2017
#> 14 2 2017
#> 15 3 2017
#> 16 3 2017
crossing drops duplicate rows.
df_a_crossing_b <- crossing(df_a, df_b)
df_a_crossing_b
#> # A tibble: 12 × 2
#> department period
#> <dbl> <dbl>
#> 1 1 2014
#> 2 1 2015
#> 3 1 2016
#> 4 1 2017
#> 5 2 2014
#> 6 2 2015
#> 7 2 2016
#> 8 2 2017
#> 9 3 2014
#> 10 3 2015
#> 11 3 2016
#> 12 3 2017
full_join doesn't remove duplicates either.
df_a_full_join_b <- full_join(df_a, df_b, by = character())
df_a_full_join_b
#> # A tibble: 16 × 2
#> department period
#> <dbl> <dbl>
#> 1 1 2014
#> 2 1 2015
#> 3 1 2016
#> 4 1 2017
#> 5 2 2014
#> 6 2 2015
#> 7 2 2016
#> 8 2 2017
#> 9 3 2014
#> 10 3 2015
#> 11 3 2016
#> 12 3 2017
#> 13 3 2014
#> 14 3 2015
#> 15 3 2016
#> 16 3 2017
packageVersion("tidyverse")
#> [1] '1.3.2'
Created on 2023-01-13 with reprex v2.0.2

Resources