Condition on grouped observation in a panel dataframe - r

I have a dataframe that looks like this
id year changetype
1 2010 1
1 2012 2
2 2014 2
2 2014 2
3 2012 1
3 2012 2
3 2014 2
3 2014 1
I want to get something like this
id year changetype
1 2010 1
1 2012 2
2 2014 2
2 2014 2
In other words I want to remove all observations associated with id 3 because, in the same year (2012) id=3 presents both changetype=1 and changetype=2.
How can I impose a condition on variable for grouped observation by id and year?
Many thanks to everyone helping me.

You can use data.table package to achieve this-
library(data.table)
setDT(dt)
dt[,count:=lapply(.SD,function(x)length(unique(x))), by=.(id,year)]
dt[,keep:=uniqueN(count), by=id][keep==1,.(id,year,changetype)]
id year changetype
1: 1 2010 1
2: 1 2012 2
3: 2 2014 2
4: 2 2014 2

Related

Extract all possible combinations of rows with unique values in a variable

I am trying to perform a meta-analysis on a dataset in which multiple authors have multiple studies which might cause bias. Therefore, I want to extract all the possible combinations of rows, in which any Author appears once.
Sample data:
sample <- data.frame(Author = c('a','a','b','b','c'),
Year = c('2020','2016', '2020','2010','2005'),
Value = c(3,1,2,4,5),
UniqueName = c('a 2020', 'a 2016', 'b 2020', 'b 2010', 'c 2005'))
Sample:
Author Year Value UniqueName
1 a 2020 3 a 2020
2 a 2016 1 a 2016
3 b 2020 2 b 2020
4 b 2010 4 b 2010
5 c 2005 5 c 2005
And would like to extract all possible combinations of rows (in this case, 4 possibilities) where each Author appears once.
> output1
Author Year Value UniqueName
1 a 2020 3 a 2020
2 b 2020 2 b 2020
3 c 2005 5 c 2005
> output2
Author Year Value UniqueName
1 a 2016 1 a 2016
2 b 2020 2 b 2020
3 c 2005 5 c 2005
> output3
Author Year Value UniqueName
1 a 2016 1 a 2016
2 b 2010 4 b 2010
3 c 2005 5 c 2005
> output4
Author Year Value UniqueName
1 a 2020 3 a 2020
2 b 2010 4 b 2010
3 c 2005 5 c 2005
At the end, I will perform the analyses on these 4 different extracted dataframes, but I don't know how to get them in a less manual way.
Maybe a less hacky way exists, but I seem to have a working solution.
My idea was to split your dataframe on authors and brute force the combinations of unique rows with expand.grid. Then with lapply creating a list of data.frames with the indexes of rows.
Here is the code:
splitsample <- split(sample, sample$Author)
outputs_rows <- expand.grid(lapply(splitsample, \(x) seq_len(nrow(x))))
names_authors <- colnames(outputs_rows)
outputs <- lapply(seq_len(nrow(outputs_rows)),
function(row) {
df <- data.frame()
for (aut in names_authors) {
df <- rbind(df, splitsample[[aut]][outputs_rows[row, aut], ])
}
return(df)
})
outputs
And the result looks like this:
> outputs
[[1]]
Author Year Value UniqueName
1 a 2020 3 a 2020
3 b 2020 2 b 2020
5 c 2005 5 c 2005
[[2]]
Author Year Value UniqueName
2 a 2016 1 a 2016
3 b 2020 2 b 2020
5 c 2005 5 c 2005
[[3]]
Author Year Value UniqueName
1 a 2020 3 a 2020
4 b 2010 4 b 2010
5 c 2005 5 c 2005
[[4]]
Author Year Value UniqueName
2 a 2016 1 a 2016
4 b 2010 4 b 2010
5 c 2005 5 c 2005
I hope this helped you.

Create a new column with max values using the identifier column within a pipeline

I am trying to clean up some old code and convert over to "tidy". I am trying to create a new column of data within a pipeline that is the maximum age of individual fish. Let's represent the columns of interest as:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
# which looks like this:
fish_1
year fishid agei
1 2012 a 1
2 2012 a 2
3 2015 b 1
4 2015 b 2
5 2015 b 3
6 2013 c 1
7 2013 c 2
8 2013 c 3
9 2013 c 4
10 2012 d 1
11 2012 d 2
12 2015 e 1
13 2015 e 2
14 2015 e 3
What I'm trying to do is create a new column agec that is the maximum age for each individual fish repeated however many number of times is required to fill the rows for each fish.
The desired output would be:
fish_2 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3),
agec = c(2,2,3,3,3,4,4,4,4,2,2,3,3,3))
# Which looks like:
fish_2
year fishid agei agec
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3
The way I had done this in the past was to use a plyr::ddply() call to create a new dataframe and then merge with fish like this:
caps = plyr::ddply(fish_1, c('fishid'), plyr::summarize, agec=max(agei))
fish = merge(fish_1, caps, by='fishid')
fish
fishid year agei agec
1 a 2012 1 2
2 a 2012 2 2
3 b 2015 1 3
4 b 2015 2 3
5 b 2015 3 3
6 c 2013 1 4
7 c 2013 2 4
8 c 2013 3 4
9 c 2013 4 4
10 d 2012 1 2
11 d 2012 2 2
12 e 2015 1 3
13 e 2015 2 3
14 e 2015 3 3
I'm hoping someone can help me achieve this data structure concisely within a pipeline. All of the similar questions I have found have been very verbose and not specific to this issue. I am new to using tidyverse but I'm having trouble getting the group_by() function (to replace the ddply() call) within a pipe, and I'm hoping there is a simpler way.
UPDATE
For those interested it appears both answers below are correct. The reason that I struggled was because I was already completing other data manipulations within my pipeline and I tried to complete the formation of the agec column within a previous call to dplyr::mutate(). You can refer to my comment on #Thomas answer to see the error in my ways. Hope this helps.
Try dplyr instead of plyr
library(dplyr)
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))
You can use group_by from dplyr to group your fish IDs and then simply call mutate (dplyr as well) with max:
fish_1 <- data.frame(year = c(2012,2012,2015,2015,2015,2013,2013,2013,2013,2012,2012,2015,2015,2015),
fishid = c('a','a','b','b','b','c','c','c','c','d','d','e','e','e'), # unique identifier for each fish
agei = c(1,2,1,2,3,1,2,3,4,1,2,1,2,3))
fish_1 %>%
group_by(fishid) %>%
mutate(agec = max(agei))
# A tibble: 14 x 4
# Groups: fishid [5]
year fishid agei agec
<dbl> <chr> <dbl> <dbl>
1 2012 a 1 2
2 2012 a 2 2
3 2015 b 1 3
4 2015 b 2 3
5 2015 b 3 3
6 2013 c 1 4
7 2013 c 2 4
8 2013 c 3 4
9 2013 c 4 4
10 2012 d 1 2
11 2012 d 2 2
12 2015 e 1 3
13 2015 e 2 3
14 2015 e 3 3
An option with data.table
library(data.table)
setDT(fish_1)[, agec := max(agei, na.rm = TRUE), fishid]

data.table count of rows indexing by category

I'm stuck on an easy one, but didn't find a solution in either the data.table manual or around here.
dt<-data.table(account=c("treu65","treu65","treg23","treg23","treg23"),year=c("2012","2013","2013","2013","2012"))
I need to add a column with a count of rows by account and year. The problem is that I need to create two separate columns. One will contain the count for 2012, the other for 2013.
Like so:
account year count2012 count2013
1: treu65 2012 1 1
2: treu65 2013 1 1
3: treg23 2013 1 2
4: treg23 2013 1 2
5: treg23 2012 1 2
Normally I would aggregate, but in this case I need the same structure as above.
I got as far as:
dt[year==2012,count2012:=.N,.(account)]
dt[year==2013,count2013:=.N,.(account)]
But I have NAs now:
account year count2012 count2013
1: treu65 2012 1 NA
2: treu65 2013 NA 1
3: treg23 2013 NA 2
4: treg23 2013 NA 2
5: treg23 2012 1 NA
And I should get:
account year count2012 count2013
1: treu65 2012 1 1
2: treu65 2013 1 1
3: treg23 2013 1 2
4: treg23 2013 1 2
5: treg23 2012 1 2
Thank you.
You can move the filter from i position (by which you will only be able to modify specific rows) to j position and use sum to count the rows:
dt[, `:=`(count2012 = sum(year == 2012), count2013 = sum(year == 2013)), .(account)][]
# account year count2012 count2013
#1: treu65 2012 1 1
#2: treu65 2013 1 1
#3: treg23 2013 1 2
#4: treg23 2013 1 2
#5: treg23 2012 1 2

Create a new variable to epidemiological week

I have a data frame with a column week and another year (87 weeks). I need to create a new column (weekseq) with a number that identify the week sequentially from first to last. I dont know how to do. Someone can help me?
Example:
id week month year yearweek weekseq
1 1 1 2014 2014/1
1 1 1 2013 2013/1
1 2 1 2014 2014/2
1 2 1 2013 2013/2
1 3 1 2014 2014/3
1 3 1 2013 2013/3
1 4 1 2014 2014/4
1 4 1 2013 2013/4
1 5 1 2014 2014/5
1 5 1 2013 2013/5
1 6 2 2014 2014/6
1 6 2 2013 2013/6
1 7 2 2014 2014/7
1 7 2 2013 2013/7
1 8 2 2014 2014/8
1 8 2 2013 2013/8
1 9 2 2014 2014/9
1 9 2 2013 2013/9
1 10 3 2014 2014/10
1 10 3 2013 2013/10
1 11 3 2014 2014/11
1 11 3 2013 2013/11
1 12 3 2014 2014/12
1 12 3 2013 2013/12
This solution requires the 'dplyr' and 'plyr' packages:
# Coerce into tbd_df
datatbl <- tbl_df(data)
# Arrange, giving more weight to year than week
datatbl <- arrange(datatbl, year, month, week)
# Create a new column that numbers the arranged rows sequentially
seqtbl <- ddply(datatbl, .(id), transform, sequence=seq_along(id))

expand.grid() based on values in two variables in R

I would like to expand a grid in R such that the expansion occurs for unique values of one variable but joint values for two variables. For example:
frame <- data.frame(id = seq(1:2),id2 = seq(1:2), year = c(2005, 2008))
I would like to expand the frame for each year, but such that id and id2 are considered jointly (e.g. (1,1), and (2,2) to generate an output like:
id id2 year
1 1 2005
1 1 2006
1 1 2007
1 1 2005
2 2 2006
2 2 2007
2 2 2008
Using expand.grid(), does someone know how to do this? I have not been able to wrangle the code past looking at each id uniquely and producing a frame with all combinations given the following code:
with(frame, expand.grid(year = seq(min(year), max(year)), id = unique(id), id2 = unique(id2)))
Thanks for any and all help.
You could do this with reshape::expand.grid.df
require(reshape)
expand.grid.df(data.frame(id=1:2,id2=1:2), data.frame(year=c(2005:2008)))
> expand.grid.df(data.frame(id=1:2,id2=1:2), data.frame(year=c(2005:2008)))
id id2 year
1 1 1 2005
2 2 2 2005
3 1 1 2006
4 2 2 2006
5 1 1 2007
6 2 2 2007
7 1 1 2008
8 2 2 2008
Here is another way using base R
indx <- diff(frame$year)+1
indx1 <- rep(1:nrow(frame), each=indx)
frame1 <- transform(frame[indx1,1:2], year=seq(frame$year[1], length.out=indx, by=1))
row.names(frame1) <- NULL
frame1
# id id2 year
#1 1 1 2005
#2 1 1 2006
#3 1 1 2007
#4 1 1 2008
#5 2 2 2005
#6 2 2 2006
#7 2 2 2007
#8 2 2 2008

Resources