Fill observation from latest year in grouped data using data.table - r

For each id, I am trying to fill the value in the code column corresponding to the latest year using data.table.
Data:
df <- data.frame(
id=c(1,1,1,2,2,2,3,3,3),
year=c(2014, 2015, 2016, 2015, 2015, 2016, NA, NA, 2016),
code=c(1,2,2, 1,2,3, 3,4,5)
)
> df
id year code
1 1 2014 1
2 1 2015 2
3 1 2016 2
4 2 2015 1
5 2 2015 2
6 2 2016 3
7 3 NA 3
8 3 NA 4
9 3 2016 5
In dplyr:
df %>% group_by(id) %>%
mutate(code2=last(na.omit(code[order(year, na.last=F)])))
# A tibble: 9 x 4
# Groups: id [3]
id year code code2
<dbl> <dbl> <dbl> <dbl>
1 1 2014 1 2
2 1 2015 2 2
3 1 2016 2 2
4 2 2015 1 3
5 2 2015 2 3
6 2 2016 3 3
7 3 NA 3 5
8 3 NA 4 5
9 3 2016 5 5
Attempt in data.table:
df %>%
as.data.table() %>%
.[,code2:=last(na.omit(code[order(year, na.last=F)]), by=id)] %>%
as.data.table()

Try data.table like below
> setDT(df)[,code2:=code[which.max(year)],id][]
id year code code2
1: 1 2014 1 2
2: 1 2015 2 2
3: 1 2016 2 2
4: 2 2015 1 3
5: 2 2015 2 3
6: 2 2016 3 3
7: 3 NA 3 5
8: 3 NA 4 5
9: 3 2016 5 5

Related

Keep levels based on multiple conditions on another column in r data frame

I have a data frame that looks like this:
spp year month count
1 2020 2 2
1 2020 2 3
1 2020 5 4
1 2020 5 3
1 2021 2 2
1 2021 2 4
2 2020 2 2
2 2020 2 6
2 2020 5 3
3 2021 2 4
3 2021 2 4
4 2020 2 3
4 2020 2 6
4 2020 5 5
4 2020 5 7
I only want to keep the species that 1) have at least two observations per month and 2) have observations in at least two different months. I want to end up with something like this:
spp year month count
1 2020 2 2
1 2020 2 3
1 2020 5 4
1 2020 5 3
1 2021 2 2
1 2021 2 4
4 2020 2 3
4 2020 2 6
4 2020 5 5
4 2020 5 7
I'm only working with two months in 2020 (2 and 5) and one month in 2021 (2). I think filter from the dplyr package might work but I have no idea how to go on about it.
Thanks in advance.
You can use the following code with two filters and two group_by:
library(dplyr)
df %>%
group_by(spp, month) %>%
filter(n() >= 2) %>%
group_by(spp) %>%
filter(n_distinct(month) >= 2) %>%
ungroup()
#> # A tibble: 10 × 4
#> spp year month count
#> <int> <int> <int> <int>
#> 1 1 2020 2 2
#> 2 1 2020 2 3
#> 3 1 2020 5 4
#> 4 1 2020 5 3
#> 5 1 2021 2 2
#> 6 1 2021 2 4
#> 7 4 2020 2 3
#> 8 4 2020 2 6
#> 9 4 2020 5 5
#> 10 4 2020 5 7
Created on 2022-08-27 with reprex v2.0.2
We may also do
library(dplyr)
df1 %>%
group_by(spp) %>%
add_count(month) %>%
filter(n>=2, n_distinct(month[n >=2]) >=2 ) %>%
ungroup %>%
select(-n)
-output
# A tibble: 10 × 4
spp year month count
<int> <int> <int> <int>
1 1 2020 2 2
2 1 2020 2 3
3 1 2020 5 4
4 1 2020 5 3
5 1 2021 2 2
6 1 2021 2 4
7 4 2020 2 3
8 4 2020 2 6
9 4 2020 5 5
10 4 2020 5 7

Retaining unique values per individual id in a dataframe in R

A very basic question! I tried finding searching a lot and using my own brain but eventually, had to come here.. :)
Well here is a sample dataframe
df<- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3),
quarter=c(1,2,3,4,1,2,3,4,1,2,3,4),
year=c(2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2015,2015),
value=c(2.75,2.75,2.75,2.75,2.90,2.90,2.90,2.90,2.21,2.21,2.21,2.21))
> df
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 2.75
3 1 3 2015 2.75
4 1 4 2015 2.75
5 2 1 2015 2.90
6 2 2 2015 2.90
7 2 3 2015 2.90
8 2 4 2015 2.90
9 3 1 2015 2.21
10 3 2 2015 2.21
11 3 3 2015 2.21
12 3 4 2015 2.21
I need unique value per id. So, I use this-
df$value[duplicated(df$value)]<-NA
And I get what I need.
> df
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 NA
3 1 3 2015 NA
4 1 4 2015 NA
5 2 1 2015 2.90
6 2 2 2015 NA
7 2 3 2015 NA
8 2 4 2015 NA
9 3 1 2015 2.21
10 3 2 2015 NA
11 3 3 2015 NA
12 3 4 2015 NA
Now lets say that I have the a new dataframe with more similar values -
df<- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3),
quarter=c(1,2,3,4,1,2,3,4,1,2,3,4),
year=c(2015,2015,2015,2015,2016,2016,2016,2016,2015,2015,2015,2015),
value=c(2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.21,2.21,2.21,2.21))
If I use the same code, I will end up with data missing for ID 2 as well.
How could I retain unique values for every ID per year??
Any help is much appreciated.
Here is a base R solution using ave + duplicated
df <- within(df,value <- ave(value,
id,
year,
FUN = function(v) ifelse(duplicated(v),NA,v)))
such that
> df
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 NA
3 1 3 2015 NA
4 1 4 2015 NA
5 2 1 2015 2.90
6 2 2 2015 NA
7 2 3 2015 NA
8 2 4 2015 NA
9 3 1 2015 2.21
10 3 2 2015 NA
11 3 3 2015 NA
12 3 4 2015 NA
Using duplicated on cbind id and year instead of value should give you the desired result:
df[duplicated(cbind(df$id, df$year)), "value"]<-NA
Using this solution on your second data.frame that gave you missing rows:
df<- data.frame(id=c(1,1,1,1,2,2,2,2,3,3,3,3),
quarter=c(1,2,3,4,1,2,3,4,1,2,3,4),
year=c(2015,2015,2015,2015,2016,2016,2016,2016,2015,2015,2015,2015),
value=c(2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.75,2.21,2.21,2.21,2.21))
df[duplicated(cbind(df$id, df$year)), "value"]<-NA
Returns:
id quarter year value
1 1 1 2015 2.75
2 1 2 2015 NA
3 1 3 2015 NA
4 1 4 2015 NA
5 2 1 2016 2.75
6 2 2 2016 NA
7 2 3 2016 NA
8 2 4 2016 NA
9 3 1 2015 2.21
10 3 2 2015 NA
11 3 3 2015 NA
12 3 4 2015 NA

calculating cumulatives within a group correctly

I hope anyone can help with this. I have a data frame similar to this:
test <- data.frame(ID = c(1:24),
group = rep(c(1,1,1,1,1,1,2,2,2,2,2,2),2),
year1 = rep(c(2018,2018,2018,2019,2019,2019),4),
month1 = rep(c(1,2,3),8))
Now I want to do a cumsum per group but when I use the following code the sumsum 'restarts' each year.
test2 <-test %>%
group_by(group,year1,month1) %>%
summarise(a = length(unique(ID))) %>%
mutate(a = cumsum(a))
My desired output is:
group year1 month1 a
1 1 2018 1 2
2 1 2018 2 4
3 1 2018 3 6
4 1 2019 1 8
5 1 2019 2 10
6 1 2019 3 12
7 2 2018 1 2
8 2 2018 2 4
9 2 2018 3 6
10 2 2019 1 8
11 2 2019 2 10
12 2 2019 3 12
You could first count unique ID for each group, month and year and then take cumsum of it for each group.
library(dplyr)
test %>%
group_by(group, year1, month1) %>%
summarise(a = n_distinct(ID)) %>%
group_by(group) %>%
mutate(a = cumsum(a))
# group year1 month1 a
# <dbl> <dbl> <dbl> <int>
# 1 1 2018 1 2
# 2 1 2018 2 4
# 3 1 2018 3 6
# 4 1 2019 1 8
# 5 1 2019 2 10
# 6 1 2019 3 12
# 7 2 2018 1 2
# 8 2 2018 2 4
# 9 2 2018 3 6
#10 2 2019 1 8
#11 2 2019 2 10
#12 2 2019 3 12
With data.table, this can be done with
library(data.table)
setDT(test)[, .(a = uniqueN(ID)), by = .(group, year1, month1)
][, a := cumsum(a), by = group]

How to get all my values within the same categorie to be equal in my dataframe?

So, I have a dataset that looks just like that :
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA NA
3 10 2015 2.0 1
4 10 2014 NA NA
5 10 2013 NA NA
6 11 2012 NA NA
7 11 2011 0.0 2
8 11 2010 NA NA
9 11 2009 1.0 2
But I do not want to have NAs in the cat column. Instead, I want every line within the same site to get the same value of cat.
Just like this :
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA 1
3 10 2015 2.0 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0.0 2
8 11 2010 NA 2
9 11 2009 1.0 2
Any idea on how I can do that?
Use na.aggregate to fill in the NA values using ave to do it by site.
library(zoo)
transform(DF, cat = ave(cat, site, FUN = na.aggregate))
giving:
site year territories cat
1 10 2017 0 1
2 10 2016 NA 1
3 10 2015 2 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0 2
8 11 2010 NA 2
9 11 2009 1 2
Note
The input used, in reproducible form, is:
Lines <- "
site year territories cat
1 10 2017 0.0 1
2 10 2016 NA NA
3 10 2015 2.0 1
4 10 2014 NA NA
5 10 2013 NA NA
6 11 2012 NA NA
7 11 2011 0.0 2
8 11 2010 NA NA
9 11 2009 1.0 2"
DF <- read.table(text = Lines)
A complete base R alternative:
transform(DF, cat = ave(cat, site, FUN = function(x) x[!is.na(x)][1]))
which gives:
site year territories cat
1 10 2017 0 1
2 10 2016 NA 1
3 10 2015 2 1
4 10 2014 NA 1
5 10 2013 NA 1
6 11 2012 NA 2
7 11 2011 0 2
8 11 2010 NA 2
9 11 2009 1 2
The same logic implemented with dplyr:
library(dplyr)
DF %>%
group_by(site) %>%
mutate(cat = na.omit(cat)[1])
Or with na.locf of the zoo-package:
library(zoo)
transform(DF, cat = ave(cat, site, FUN = function(x) na.locf(na.locf(x, fromLast = TRUE, na.rm = FALSE))))
Or with fill from tidyr:
library(tidyr)
library(dplyr)
DF %>%
group_by(site) %>%
fill(cat) %>%
fill(cat, .direction = "up")
NOTE: I'm wondered what the added value is of the cat-column when cat has to be the same for each site. You'll end up with two grouping variables that do exactly the same, thus making one ot them redundant imo.
You can also use tidyr::fill
library(dplyr)
library(tidyr)
DF %>%
group_by(site) %>%
fill(cat,.direction = "up") %>%
fill(cat,.direction = "down") %>%
ungroup
# # A tibble: 9 x 4
# site year territories cat
# <int> <int> <dbl> <int>
# 1 10 2017 0 1
# 2 10 2016 NA 1
# 3 10 2015 2 1
# 4 10 2014 NA 1
# 5 10 2013 NA 1
# 6 11 2012 NA 2
# 7 11 2011 0 2
# 8 11 2010 NA 2
# 9 11 2009 1 2

Add a "rank" column to a data frame

I have a dataframe with counts of different items, in different years:
df <- data.frame(item = rep(c('a','b','c'), 3),
year = rep(c('2010','2011','2012'), each=3),
count = c(1,4,6,3,8,3,5,7,9))
And I would like to add a "year.rank" column, which gives an item's rank within a given year, where a higher count leads to a higher "rank". With the above, it would look like:
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
I know I could do this for the whole data frame using order(df$count), but I'm not sure how I would do it by year.
There is a rank function to help you with that:
transform(df,
year.rank = ave(count, year,
FUN = function(x) rank(-x, ties.method = "first")))
item year count year.rank
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
data.table version for practice:
library(data.table)
DT <- as.data.table(df)
DT[,yrrank:=rank(-count,ties.method="first"),by=year]
item year count yrrank
1: a 2010 1 3
2: b 2010 4 2
3: c 2010 6 1
4: a 2011 3 2
5: b 2011 8 1
6: c 2011 3 3
7: a 2012 5 3
8: b 2012 7 2
9: c 2012 9 1
Using order function,
transform(dat, x= ave(count,year,FUN=function(x) order(x,decreasing=T)))
item year count x
1 a 2010 1 3
2 b 2010 4 2
3 c 2010 6 1
4 a 2011 3 2
5 b 2011 8 1
6 c 2011 3 3
7 a 2012 5 3
8 b 2012 7 2
9 c 2012 9 1
EDIT
You can use plyr here also:
ddply(dat,.(year),transform,x = order(count,decreasing=T))
Using dplyr you could do it as follows:
library(dplyr) # 0.4.1
df %>%
group_by(year) %>%
mutate(yrrank = row_number(-count))
#Source: local data frame [9 x 4]
#Groups: year
#
# item year count yrrank
#1 a 2010 1 3
#2 b 2010 4 2
#3 c 2010 6 1
#4 a 2011 3 2
#5 b 2011 8 1
#6 c 2011 3 3
#7 a 2012 5 3
#8 b 2012 7 2
#9 c 2012 9 1
It is the same as:
df %>%
group_by(year) %>%
mutate(yrrank = rank(-count, ties.method = "first"))
Note that the resulting data is still grouped by "year". If you want to remove the grouping you can simply extend the pipe with %>% ungroup().
While using the answers given by others, I found that the following performs faster than the transform and dyplr variants:
df$year.rank <- ave(count, year, FUN = function(x) rank(-x, ties.method = "first"))

Resources