Extract all possible combinations of rows with unique values in a variable - r

I am trying to perform a meta-analysis on a dataset in which multiple authors have multiple studies which might cause bias. Therefore, I want to extract all the possible combinations of rows, in which any Author appears once.
Sample data:
sample <- data.frame(Author = c('a','a','b','b','c'),
Year = c('2020','2016', '2020','2010','2005'),
Value = c(3,1,2,4,5),
UniqueName = c('a 2020', 'a 2016', 'b 2020', 'b 2010', 'c 2005'))
Sample:
Author Year Value UniqueName
1 a 2020 3 a 2020
2 a 2016 1 a 2016
3 b 2020 2 b 2020
4 b 2010 4 b 2010
5 c 2005 5 c 2005
And would like to extract all possible combinations of rows (in this case, 4 possibilities) where each Author appears once.
> output1
Author Year Value UniqueName
1 a 2020 3 a 2020
2 b 2020 2 b 2020
3 c 2005 5 c 2005
> output2
Author Year Value UniqueName
1 a 2016 1 a 2016
2 b 2020 2 b 2020
3 c 2005 5 c 2005
> output3
Author Year Value UniqueName
1 a 2016 1 a 2016
2 b 2010 4 b 2010
3 c 2005 5 c 2005
> output4
Author Year Value UniqueName
1 a 2020 3 a 2020
2 b 2010 4 b 2010
3 c 2005 5 c 2005
At the end, I will perform the analyses on these 4 different extracted dataframes, but I don't know how to get them in a less manual way.

Maybe a less hacky way exists, but I seem to have a working solution.
My idea was to split your dataframe on authors and brute force the combinations of unique rows with expand.grid. Then with lapply creating a list of data.frames with the indexes of rows.
Here is the code:
splitsample <- split(sample, sample$Author)
outputs_rows <- expand.grid(lapply(splitsample, \(x) seq_len(nrow(x))))
names_authors <- colnames(outputs_rows)
outputs <- lapply(seq_len(nrow(outputs_rows)),
function(row) {
df <- data.frame()
for (aut in names_authors) {
df <- rbind(df, splitsample[[aut]][outputs_rows[row, aut], ])
}
return(df)
})
outputs
And the result looks like this:
> outputs
[[1]]
Author Year Value UniqueName
1 a 2020 3 a 2020
3 b 2020 2 b 2020
5 c 2005 5 c 2005
[[2]]
Author Year Value UniqueName
2 a 2016 1 a 2016
3 b 2020 2 b 2020
5 c 2005 5 c 2005
[[3]]
Author Year Value UniqueName
1 a 2020 3 a 2020
4 b 2010 4 b 2010
5 c 2005 5 c 2005
[[4]]
Author Year Value UniqueName
2 a 2016 1 a 2016
4 b 2010 4 b 2010
5 c 2005 5 c 2005
I hope this helped you.

Related

Create new dataframe from existing dataframe based on unique column values in R

So, I have a data frame that looks like this:
plot <- data.frame(plot=c("A", "A", "A", "B", "B", "C", "C", "C"),
grid= c(1,1,1,2,2,3,3,3),
year1=c(2000,2000,2010,2000,2010,2000,2010,2010),
year2=c(2005,2005,2015,2005,2015,2005,2015,2015))
plot
plot grid year1 year2
1 A 1 2000 2005
2 A 1 2000 2005
3 A 1 2010 2015
4 B 2 2000 2005
5 B 2 2010 2015
6 C 3 2000 2005
7 C 3 2010 2015
8 C 3 2010 2015
So for the plot column I have repeated values, the grid is always unique for each of the plots but the years are changing, what I want basically is a new data frame which will just keep all the unique combinations from these four columns, which would look like this:
plot grid year1 year2
1 A 1 2000 2005
2 A 1 2010 2015
3 B 2 2000 2005
4 B 2 2010 2015
5 C 3 2000 2005
6 C 3 2010 2015
I tried to look for solution but I could not fine anything that fits to my example.
Use distinct:
library(dplyr)
distinct(plot)
plot grid year1 year2
1 A 1 2000 2005
2 A 1 2010 2015
3 B 2 2000 2005
4 B 2 2010 2015
5 C 3 2000 2005
6 C 3 2010 2015
Or in base R, with duplicated:
plot[!duplicated(plot),]
data.table option using unique:
library(data.table)
unique(setDT(plot))
#> plot grid year1 year2
#> 1: A 1 2000 2005
#> 2: A 1 2010 2015
#> 3: B 2 2000 2005
#> 4: B 2 2010 2015
#> 5: C 3 2000 2005
#> 6: C 3 2010 2015
Created on 2022-11-04 with reprex v2.0.2

Create a new column with max values using the identifier column within a pipeline

I am trying to clean up some old code and convert over to "tidy". I am trying to create a new column of data within a pipeline that is the maximum age of individual fish. Let's represent the columns of interest as:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
# which looks like this:
fish_1
year fishid agei
1 2012 a 1
2 2012 a 2
3 2015 b 1
4 2015 b 2
5 2015 b 3
6 2013 c 1
7 2013 c 2
8 2013 c 3
9 2013 c 4
10 2012 d 1
11 2012 d 2
12 2015 e 1
13 2015 e 2
14 2015 e 3
What I'm trying to do is create a new column agec that is the maximum age for each individual fish repeated however many number of times is required to fill the rows for each fish.
The desired output would be:
fish_2 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3),
agec = c(2,2,3,3,3,4,4,4,4,2,2,3,3,3))
# Which looks like:
fish_2
year fishid agei agec
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3
The way I had done this in the past was to use a plyr::ddply() call to create a new dataframe and then merge with fish like this:
caps = plyr::ddply(fish_1, c('fishid'), plyr::summarize, agec=max(agei))
fish = merge(fish_1, caps, by='fishid')
fish
fishid year agei agec
1 a 2012 1 2
2 a 2012 2 2
3 b 2015 1 3
4 b 2015 2 3
5 b 2015 3 3
6 c 2013 1 4
7 c 2013 2 4
8 c 2013 3 4
9 c 2013 4 4
10 d 2012 1 2
11 d 2012 2 2
12 e 2015 1 3
13 e 2015 2 3
14 e 2015 3 3
I'm hoping someone can help me achieve this data structure concisely within a pipeline. All of the similar questions I have found have been very verbose and not specific to this issue. I am new to using tidyverse but I'm having trouble getting the group_by() function (to replace the ddply() call) within a pipe, and I'm hoping there is a simpler way.
UPDATE
For those interested it appears both answers below are correct. The reason that I struggled was because I was already completing other data manipulations within my pipeline and I tried to complete the formation of the agec column within a previous call to dplyr::mutate(). You can refer to my comment on #Thomas answer to see the error in my ways. Hope this helps.
Try dplyr instead of plyr
library(dplyr)
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))
You can use group_by from dplyr to group your fish IDs and then simply call mutate (dplyr as well) with max:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))
# A tibble: 14 x 4
# Groups: fishid [5]
year fishid agei agec
<dbl> <chr> <dbl> <dbl>
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3
An option with data.table
library(data.table)
setDT(fish_1)[, agec := max(agei, na.rm = TRUE), fishid]

Create groups based on time period

How can I create a new grouping variable for my data based on 5-year steps?
So from this:
group <- c(rep("A", 7), rep("B", 10))
year <- c(2008:2014, 2005:2014)
dat <- data.frame(group, year)
group year
1 A 2008
2 A 2009
3 A 2010
4 A 2011
5 A 2012
6 A 2013
7 A 2014
8 B 2005
9 B 2006
10 B 2007
11 B 2008
12 B 2009
13 B 2010
14 B 2011
15 B 2012
16 B 2013
17 B 2014
To this:
> dat
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014
7 A 2014 2010_2014
8 B 2005 2005_2009
9 B 2006 2005_2009
10 B 2007 2005_2009
11 B 2008 2005_2009
12 B 2009 2005_2009
13 B 2010 2010_2014
14 B 2011 2010_2014
15 B 2012 2010_2014
16 B 2013 2010_2014
17 B 2014 2010_2014
I guess I could use cut(dat$year, breaks = ??) but I don't know how to set the breaks.
Here is one way of doing it:
dat$period <- paste(min <- floor(dat$year/5)*5, min+4,sep = "_")
I guess the trick here is to get the biggest whole number smaller than your year with the floor(year/x)*x function.
Here is a version that should work generally:
x <- 5
yearstart <- 2000
dat$period <- paste(min <- floor((dat$year-yearstart)/x)*x+yearstart,
min+x-1,sep = "_")
You can use yearstart to ensure e.g. year 2000 is the first in a group for when x is not a multiple of it.
cut should do the job if you create actual Date objects from your 'year' column.
## convert 'year' column to dates
yrs <- paste0(dat$year, "-01-01")
yrs <- as.Date(yrs)
## create cuts of 5 years and add them to data.frame
dat$period <- cut(yrs, "5 years")
## create desired factor levels
library(lubridate)
lvl <- as.Date(levels(dat$period))
lvl <- paste(year(lvl), year(lvl) + 4, sep = "_")
levels(dat$period) <- lvl
head(dat)
group year period
1 A 2008 2005_2009
2 A 2009 2005_2009
3 A 2010 2010_2014
4 A 2011 2010_2014
5 A 2012 2010_2014
6 A 2013 2010_2014

Merge 2 resulting vectors into 1 data frame using R

I have a df like this
Month <- c('JAN','JAN','JAN','JAN','FEB','FEB','MAR','APR','MAY','MAY')
Category <- c('A','A','B','C','A','E','B','D','E','F')
Year <- c(2014,2015,2015,2015,2014,2013,2015,2014,2015,2013)
Number_Combinations <- c(3,2,3,4,1,3,6,5,1,1)
df <- data.frame(Month ,Category,Year,Number_Combinations)
df
Month Category Year Number_Combinations
1 JAN A 2014 3
2 JAN A 2015 2
3 JAN B 2015 3
4 JAN C 2015 4
5 FEB A 2014 1
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
9 MAY E 2015 1
10 MAY F 2013 1
I have another df that I got from the above dataframe with a condition
df1 <- subset(df,Number_Combinations > 2)
df1
Month Category Year Number_Combinations
1 JAN A 2014 3
3 JAN B 2015 3
4 JAN C 2015 4
6 FEB E 2013 3
7 MAR B 2015 6
8 APR D 2014 5
Now I want to create a table reporting the month, the total number of rows for the month in df and the total number of for the month in df1
Desired Output would be
Month Number_Month_df Number_Month_df1
1 JAN 4 3
2 FEB 2 1
3 MAR 1 1
4 APR 1 1
5 MAY 2 0
While I used table(df) and table(df1) and tried merging but not getting the desired result. Could someone please help me in getting the above dataframe?
We get the table of the 'Month' column from both 'df' and 'df1', convert to 'data.frame' (as.data.frame), merge by the 'Var1', and change the column names accordingly.
res <- merge(as.data.frame(table(df$Month)),
as.data.frame(table(df1$Month)), by='Var1')
colnames(res) <- c('Month', 'Number_Month_df', 'Number_Month_df1')
res <- data.frame(Number_Month_df=sort(table(df$Month),T),
Number_Month_df1=sort(table(df1$Month),T))
res$Month <- rownames(res)

repeat rows in a dataset based on a column, but increment the rows [duplicate]

This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 5 years ago.
I have a dataset which has project name, start year and contract term. I need to develop this dataset into time series. For example, one row in my dataset is: Project A, start year 2003 and contract term 5. I would like to repeat each row based on contract term. My dataset looks like this:
Project Name Start Year Contract Term
A 2003 5
B 2013 3
C 2000 2
My desired result should look like this:
Project Name Start Year Contract Term
A 2003 5
A 2004 5
A 2005 5
A 2006 5
A 2007 5
B 2013 3
B 2014 3
B 2014 3
C 2000 2
C 2001 2
I have tried:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
But this only repeats each project by the number in contract term. I can not make it to increment the years.
Thanks in advance!
Here it is in two steps:
Step 1, you know:
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2003 5
# 1.1 A 2003 5
# 1.2 A 2003 5
# 1.3 A 2003 5
# 1.4 A 2003 5
# 2 B 2013 3
# 2.1 B 2013 3
# 2.2 B 2013 3
# 3 C 2000 2
# 3.1 C 2000 2
Step 2 makes use of sequence and basic addition:
sequence(rpsInput$Contract.Term) ## This will be helpful...
# [1] 1 2 3 4 5 1 2 3 1 2
rpsData$Start.Year <- rpsData$Start.Year + sequence(rpsInput$Contract.Term)
rpsData
# Project.Name Start.Year Contract.Term
# 1 A 2004 5
# 1.1 A 2005 5
# 1.2 A 2006 5
# 1.3 A 2007 5
# 1.4 A 2008 5
# 2 B 2014 3
# 2.1 B 2015 3
# 2.2 B 2016 3
# 3 C 2001 2
# 3.1 C 2002 2
Just to piggy back on Ananda's answer, change
sequence(rpsInput$Contract.Term)
to
(sequence(rpsInput$Contract.Term)-1)
to get the output you desire.
ProjectName<-c("A","B","C")
Start.Year<-c(2003,2013,2000)
Contract.Term<-c(5,3,2)
rpsInput<-data.frame(ProjectName,Start.Year,Contract.Term)
rpsData <- rpsInput[rep(rownames(rpsInput), rpsInput$Contract.Term), ]
rpsData$Start.Year <- rpsData$Start.Year + (sequence(rpsInput$Contract.Term)-1)
rpsData
# ProjectName Start.Year Contract.Term
#1 A 2003 5
#1.1 A 2004 5
#1.2 A 2005 5
#1.3 A 2006 5
#1.4 A 2007 5
#2 B 2013 3
#2.1 B 2014 3
#2.2 B 2015 3
#3 C 2000 2
#3.1 C 2001 2

Resources