Lags with nonconsecutive index - r

I have data with a missing index, for example:
df <- data.frame(year = c(2000:2004, 2006), value = c(0:4,6) ^ 2)
# year value
# 1 2000 0
# 2 2001 1
# 3 2002 4
# 4 2003 9
# 5 2004 16
# 6 2006 36
I would like to compute the lagged value for each year. If I use the lag function,
library(dplyr)
wrong <- mutate(df, prev = lag(value, order_by = year))
# year value prev
# 1 2000 0 NA
# 2 2001 1 0
# 3 2002 4 1
# 4 2003 9 4
# 5 2004 16 9
# 6 2006 36 16
it gives a lagged value for 2006 despite not having data on 2005. Can I get the previous year's value with the lag function?
Currently, I know I can do the following, but it's inefficient and messy:
right <- df %>% group_by(year) %>%
mutate(prev = ifelse(sum(df$year == year) == 1, df$value[df$year == year-1], NA))
# # A tibble: 6 x 3
# # Groups: year [6]
# year value prev
# <dbl> <dbl> <dbl>
# 1 2000 0 NA
# 2 2001 1.00 0
# 3 2002 4.00 1.00
# 4 2003 9.00 4.00
# 5 2004 16.0 9.00
# 6 2006 36.0 NA

Here's one simple approach:
mutate(df, prev = value[match(year - 1, year)])
# year value prev
# 1 2000 0 NA
# 2 2001 1 0
# 3 2002 4 1
# 4 2003 9 4
# 5 2004 16 9
# 6 2006 36 NA

Related

Remove columns that are all NA for at least one level of a factor

I am hoping to tidy a dataframe by removing variables that are empty for any level of a grouping factor. It is fairly easy to remove columns that are entirely empty, however there appears to be no simple way to apply this selection over groups.
## Data
site<-c("A","A","A","A","A","B","B","B","B","B")
year<-c("2000","2001","2002","2003","2004","2000","2001","2002","2003","2004")
species_A<-c(1,2,3,4,5,NA,NA,NA,NA,NA)
species_B<-c(1,2,NA,4,5,NA,3,4,5,6)
species_C<-c(1,2,3,4,5,2,3,4,5,6)
dat<-data.frame(site,year,species_A,species_B,species_C)
site year species_A species_B species_C
1 A 2000 1 1 1
2 A 2001 2 2 2
3 A 2002 3 NA 3
4 A 2003 4 4 4
5 A 2004 5 5 5
6 B 2000 NA NA 2
7 B 2001 NA 3 3
8 B 2002 NA 4 4
9 B 2003 NA 5 5
10 B 2004 NA 6 6
## Remove columns with any NAs
dat %>%
group_by(site) %>%
select(where( ~!any(is.na(.x))))
## which returns
site year species_C
<chr> <chr> <dbl>
1 A 2000 1
2 A 2001 2
3 A 2002 3
4 A 2003 4
5 A 2004 5
6 B 2000 2
7 B 2001 3
8 B 2002 4
9 B 2003 5
10 B 2004 6
## Alternatively, if i try using "all" in select it will only identify fully incomplete cases.
dat %>%
group_by(site) %>%
select(where( ~!all(is.na(.x))))
## however I am trying to get...
site year species_B species_C
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6
It seems like this should be fairly straightforward but for whatever reason I cannot seem to get it to work.
Thanks!
Another option:
dat %>%
select(site, dat %>%
group_by(site) %>%
summarise(across(everything(), ~!all(is.na(.x))))%>%
ungroup() %>%
select(-site) %>%
select(where(all))%>%
names())
site year species_B species_C
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6
We can split by site, then use select(where(!all(is.na(.x))) to drop the all-NA columns for every dataframe, and finally subset dat by the intersection of column names.
library(dplyr)
library(map)
dat %>% split(site) %>%
map(\(x) select(x, where(~!all(is.na(.x)))))%>%
map(names)%>%
reduce(intersect)%>%
dat[.]
Or, for a purrr-only solution:
library(purrr)
dat %>% split(site) %>%
map(~discard(., ~all(is.na(.x))))%>%
map(names)%>%
reduce(intersect)%>%
dat[.]
As an alternative, we can call summarise twice: once on grouped data to tell if any group is all-NAs, and a second call to obtain the final logical vector. Then subset dat with the logical vector:
library(dplyr)
dat %>% group_by(site) %>%
summarise(across(.fns = ~all(is.na(.x))))%>%
summarise(across(.fns = ~!(is.logical(.x) & any(.x))))%>%
unlist()%>%
dat[,.]
OR
dat %>% group_by(site) %>%
summarise(across(.fns = ~all(is.na(.x))))%>%
map_lgl(~!(is.logical(.x) & any(.x)))%>%
dat[,.]
output
site year species_B species_C
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6
You could convert to a long format, remove the variable, then change back to a wide format.
library(tidyverse)
dat %>%
tidyr::pivot_longer(!c(site, year), names_to = "species", values_to = "values") %>%
dplyr::group_by(site, species) %>%
dplyr::mutate(allNA = all(is.na(values))) %>%
dplyr::ungroup(site) %>%
dplyr::filter(!any(allNA == TRUE)) %>%
dplyr::select(-allNA) %>%
tidyr::pivot_wider(names_from = "species", values_from = "values")
Output
# A tibble: 10 × 4
site year species_B species_C
<chr> <chr> <dbl> <dbl>
1 A 2000 1 1
2 A 2001 2 2
3 A 2002 NA 3
4 A 2003 4 4
5 A 2004 5 5
6 B 2000 NA 2
7 B 2001 3 3
8 B 2002 4 4
9 B 2003 5 5
10 B 2004 6 6

calculate the sum in a data.frame (long format)

I want to calculate the sum for this data.frame for the years 2005 ,2006, 2007 and the categories a, b, c.
year <- c(2005,2005,2005,2006,2006,2006,2007,2007,2007)
category <- c("a","a","a","b","b","b","c","c","c")
value <- c(3,6,8,9,7,4,5,8,9)
df <- data.frame(year, category,value, stringsAsFactors = FALSE)
The table should look like this:
year
category
value
2005
a
1
2005
a
1
2005
a
1
2006
b
2
2006
b
2
2006
b
2
2007
c
3
2007
c
3
2007
c
3
2006
a
3
2007
b
6
2008
c
9
Any idea how this could be implemented?
add_row or cbind maybe?
How about like this using the dplyr package:
df %>%
group_by(year, category) %>%
summarise(sum = sum(value))
# # A tibble: 3 × 3
# # Groups: year [3]
# year category sum
# <dbl> <chr> <dbl>
# 1 2005 a 17
# 2 2006 b 20
# 3 2007 c 22
If you would rather add a column that is the sum than collapse it, replace summarise() with mutate()
df %>%
group_by(year, category) %>%
mutate(sum = sum(value))
# # A tibble: 9 × 4
# # Groups: year, category [3]
# year category value sum
# <dbl> <chr> <dbl> <dbl>
# 1 2005 a 3 17
# 2 2005 a 6 17
# 3 2005 a 8 17
# 4 2006 b 9 20
# 5 2006 b 7 20
# 6 2006 b 4 20
# 7 2007 c 5 22
# 8 2007 c 8 22
# 9 2007 c 9 22
A base R solution using aggregate
rbind( df, aggregate( value ~ year + category, df, sum ) )
year category value
1 2005 a 3
2 2005 a 6
3 2005 a 8
4 2006 b 9
5 2006 b 7
6 2006 b 4
7 2007 c 5
8 2007 c 8
9 2007 c 9
10 2005 a 17
11 2006 b 20
12 2007 c 22

Remove group that has NAs in only some rows

I need to remove years that do not have measurements for every day of the year. Pretend this is a full set and I want to get rid of all 2001 rows because 2001 has one missing measurement.
year day value
2000 1 5
2000 2 3
2000 3 2
2000 4 3
2001 1 2
2001 2 NA
2001 3 6
2001 4 5
Sorry I don't have code attempts, I can't wrap my head around it right now and it took me forever to get this far. Prefer something I can %>% in, as it's at the end of a long run.
Filtering based on presence of NA values:
df %>%
group_by(year) %>%
filter(!anyNA(value))
Alternative filter conditions (pick what you find most readable):
all(!is.na(value))
sum(is.na(value)) == 0
!any(is.na(value))
Here's a one line solution using base R -
df %>% .[!ave(.$value, .$year, FUN = anyNA), ]
Example -
df <- data.frame(year = c(rep(2000, 4), rep(2001, 4)), day = 1:4, value = sample.int(10, 8))
df$value[6] <- NA_integer_
# year day value
# 1 2000 1 4
# 2 2000 2 3
# 3 2000 3 2
# 4 2000 4 7
# 5 2001 1 8
# 6 2001 2 NA
# 7 2001 3 1
# 8 2001 4 5
df %>% .[!ave(.$value, .$year, FUN = anyNA), ]
# year day value
# 1 2000 1 4
# 2 2000 2 3
# 3 2000 3 2
# 4 2000 4 7
In base R you could do:
subset(df,!year %in% year[is.na(value)])
# year day value
# 1 2000 1 8
# 2 2000 2 5
# 3 2000 3 4
# 4 2000 4 1

dplyr::first() to choose first non NA value

I am looking for a way to extract the first and last non-NA value from each group. I am using dplyr::first() and dplyr::last(), but I can´t work out how to choose the first or last non-NA value.
library(dplyr)
set.seed(123)
d <- data.frame(
group = rep(1:3, each = 3),
year = rep(seq(2000,2002,1),3),
value = sample(1:9, r = T))
#Introduce NA values in first row of group 2 and last row of group 3
d %>%
mutate(
value = case_when(
group == 2 & year ==2000 ~ NA_integer_,
group == 3 & year ==2002 ~ NA_integer_,
TRUE ~ value))%>%
group_by(group) %>%
mutate(
first = dplyr::first(value),
last = dplyr::last(value))
RESULT (with issue)
# A tibble: 9 x 5
# Groups: group [3]
group year value first last
<int> <dbl> <int> <int> <int>
1 1 2000 3 3 4
2 1 2001 8 3 4
3 1 2002 4 3 4
4 2 2000 NA NA 1
5 2 2001 9 NA 1
6 2 2002 1 NA 1
7 3 2000 5 5 NA
8 3 2001 9 5 NA
9 3 2002 NA 5 NA
Can you help me make the values in the "first" column for group 2 = 9 and the values in the "last" column from group 3 = 9?
I very much prefer a tidyverse solution if one such exists?
Use na.omit, compare:
first(c(NA, 11, 22))
# [1] NA
first(na.omit(c(NA, 11, 22)))
# [1] 11
Using example data:
d %>%
mutate(
value = case_when(
group == 2 & year ==2000 ~ NA_integer_,
group == 3 & year ==2002 ~ NA_integer_,
TRUE ~ value))%>%
group_by(group) %>%
mutate(
first = dplyr::first(na.omit(value)),
last = dplyr::last(na.omit(value)))
# # A tibble: 9 x 5
# # Groups: group [3]
# group year value first last
# <int> <dbl> <int> <int> <int>
# 1 1 2000 3 3 4
# 2 1 2001 8 3 4
# 3 1 2002 4 3 4
# 4 2 2000 NA 9 1
# 5 2 2001 9 9 1
# 6 2 2002 1 9 1
# 7 3 2000 5 5 9
# 8 3 2001 9 5 9
# 9 3 2002 NA 5 9

apply lag or lead in increasing order for the dataframe

df1 <- read.csv("C:/Users/uni/DS-project/df1.csv")
df1
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
6 2000 1
7 2001 2
8 2002 3
9 2003 4
10 2004 5
11 2000 1
12 2001 2
13 2002 3
14 2003 4
15 2004 5
16 2000 1
17 2001 2
18 2002 3
19 2003 4
20 2004 5
i want to apply lead so i can get the output in the below fashion.
we have set of 5 observation of each year repeated for n number of times, in output for 1st year we need to remove 2000 and its respective value, similar for second year we neglect 2000 and 2001 and its respective value, and for 3rd year remove - 2000, 2001, 2002 and its respective value. And so on.
so that we can get the below output in below manner.
output:
year value
2000 1
2001 2
2002 3
2003 4
2004 5
2001 2
2002 3
2003 4
2004 5
2002 3
2003 4
2004 5
2003 4
2004 5
please help.
Just for fun, adding a vectorized solution using matrix sub-setting
m <- matrix(rep(TRUE, nrow(df)), 5)
m[upper.tri(m)] <- FALSE
df[m,]
# year value
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 7 2001 2
# 8 2002 3
# 9 2003 4
# 10 2004 5
# 13 2002 3
# 14 2003 4
# 15 2004 5
# 19 2003 4
# 20 2004 5
Below grp is 1 for each row of the first group, 2 for the second and so on. Seq is 1, 2, 3, ... for the successive rows of each grp. Now just pick out those rows for which Seq is at least as large as grp. This has the effect of removing the first i-1 rows from the ith group for i = 1, 2, ... .
grp <- cumsum(df1$year == 2000)
Seq <- ave(grp, grp, FUN = seq_along)
subset(df1, Seq >= grp)
We could alternately write this in the less general form:
subset(df1, 1:5 >= rep(1:4, each = 5))
In any case the output from either subset statement is:
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
7 2001 2
8 2002 3
9 2003 4
10 2004 5
13 2002 3
14 2003 4
15 2004 5
19 2003 4
20 2004 5
library(dplyr)
df %>%
group_by(g = cumsum(year == 2000)) %>%
filter(row_number() >= g) %>%
ungroup %>%
select(-g)
# # A tibble: 14 x 2
# year value
# <int> <int>
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 6 2001 2
# 7 2002 3
# 8 2003 4
# 9 2004 5
# 10 2002 3
# 11 2003 4
# 12 2004 5
# 13 2003 4
# 14 2004 5
Using lapply():
to <- nrow(df) / 5 - 1
df[-unlist(lapply(1:to, function(x) seq(1:x) + 5*x)), ]
year value
1 2000 1
2 2001 2
3 2002 3
4 2003 4
5 2004 5
7 2001 2
8 2002 3
9 2003 4
10 2004 5
13 2002 3
14 2003 4
15 2004 5
19 2003 4
20 2004 5
Where unlist(lapply(1:to, function(x) seq(1:x) + 5*x)) are the indices to skip:
[1] 6 11 12 16 17 18
Using sequence:
df[5-rev(sequence(2:5)-1),]
# year value
# 1 2000 1
# 2 2001 2
# 3 2002 3
# 4 2003 4
# 5 2004 5
# 2.1 2001 2
# 3.1 2002 3
# 4.1 2003 4
# 5.1 2004 5
# 3.2 2002 3
# 4.2 2003 4
# 5.2 2004 5
# 4.3 2003 4
# 5.3 2004 5
how it works:
5-rev(sequence(2:5)-1)
# [1] 1 2 3 4 5 2 3 4 5 3 4 5 4 5
rev(sequence(2:5)-1)
# [1] 4 3 2 1 0 3 2 1 0 2 1 0 1 0
sequence(2:5)-1
# [1] 0 1 0 1 2 0 1 2 3 0 1 2 3 4
sequence(2:5)
# [1] 1 2 1 2 3 1 2 3 4 1 2 3 4 5

Resources