R get row as list - r

I have a tibble dataframe. When I run df[[1]] I can get the first column as a list but I'm wondering how do I do the same for each row?
The data frame looks something like this:
# A tibble: 9 x 4
# Groups: year [9]
year value1 value2 value3
<dbl> <int> <int> <int>
1 2001 NA 3 4
2 2002 8 3 4
3 2003 4 3 NA
4 2004 NA NA 1
5 2005 9 NA 1
6 2006 1 NA NA
7 2007 NA 5 NA
8 2008 9 5 NA
9 2009 NA 5 NA

Related

Sum previous 3 and 5 observations by group, ID and date in R

I have a very large database that looks like this. For cntext, the data appartains to different companies with their related CEOs (ID) and the different years each CEO was in charge
ID <- c(1,1,1,1,1,1,3,3,3,5,5,4,4,4,4,4,4,4)
C <- c('a','a','a','a','a','a','b','b','b','b','b','c','c','c','c','c','c','c')
fyear <- c(2000, 2001, 2002,2003,2004,2005,2000, 2001,2002,2003,2004,2000, 2001, 2002,2003,2004,2005,2006)
data <- c(30,50,22,3,6,11,5,3,7,6,9,31,5,6,7,44,33,2)
df1 <- data.frame(ID,C,fyear, data)
ID C fyear data
1 a 2000 30
1 a 2001 50
1 a 2002 22
1 a 2003 3
1 a 2004 6
1 a 2005 11
3 b 2000 5
3 b 2001 3
3 b 2002 7
5 b 2003 6
5 b 2004 9
4 c 2000 31
4 c 2001 5
4 c 2002 6
4 c 2003 7
4 c 2004 44
4 c 2005 33
4 c 2006 2
I need to build a code that allows me to sum up the previous 5 and 3 data related to each ID for every year. So t-3 and t-5 for each year. The result is something like this.
ID C fyear data data3data5
1 a 2000 30 NA NA
1 a 2001 50 NA NA
1 a 2002 22 102 NA
1 a 2003 3 75 NA
1 a 2004 6 31 111
1 a 2005 11 20 86
3 b 2000 5 NA NA
3 b 2001 3 NA NA
3 b 2002 7 15 NA
5 b 2003 6 NA NA
5 b 2004 9 NA NA
4 c 2000 31 NA NA
4 c 2001 5 NA NA
4 c 2002 6 42 NA
4 c 2003 7 18 NA
4 c 2004 44 57 93
4 c 2005 33 84 95
4 c 2006 2 79 92
I have different columns of data for which I need to perform this operation, so if somebody also knows how I can do that and create a data3 and data5 column also for the other columns of data that I have that would be amazing. But even just being able to do the summation that I need is great! Thanks a lot.
I hav looked around but don't seem to find any similar cses that satisfy my need
We can use rollsumr to perform the rolling sums.
library(dplyr, exclude = c("filter", "lag"))
library(zoo)
df1 %>%
group_by(ID, C) %>%
mutate(data3 = rollsumr(data, 3, fill = NA),
data5 = rollsumr(data, 5, fill = NA)) %>%
ungroup
## # A tibble: 18 x 6
## ID C fyear data data3 data5
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 1 a 2000 30 NA NA
## 2 1 a 2001 50 NA NA
## 3 1 a 2002 22 102 NA
## 4 1 a 2003 3 75 NA
## 5 1 a 2004 6 31 111
...snip...
To apply that to multiple columns, e.g. to apply it to fyear and to data use across:
df1 %>%
group_by(ID, C) %>%
mutate(across(c("fyear", "data"),
list(`3` = ~ rollsumr(., 3, fill = NA),
`5` = ~ rollsumr(., 5, fill = NA)),
.names = "{.col}{.fn}")) %>%
ungroup
## # A tibble: 18 x 8
## ID C fyear data fyear3 fyear5 data3 data5
## <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 a 2000 30 NA NA NA NA
## 2 1 a 2001 50 NA NA NA NA
## 3 1 a 2002 22 6003 NA 102 NA
## 4 1 a 2003 3 6006 NA 75 NA
## 5 1 a 2004 6 6009 10010 31 111
...snip...
We can use frollsum within data.table
library(data.table)
d <- 2:5
setDT(df1)[
,
c(paste0("data", d)) := lapply(d, frollsum, x = data),
.(ID, C)
]
which yields
> df1
ID C fyear data data2 data3 data4 data5
1: 1 a 2000 30 NA NA NA NA
2: 1 a 2001 50 80 NA NA NA
3: 1 a 2002 22 72 102 NA NA
4: 1 a 2003 3 25 75 105 NA
5: 1 a 2004 6 9 31 81 111
6: 1 a 2005 11 17 20 42 92
7: 3 b 2000 5 NA NA NA NA
8: 3 b 2001 3 8 NA NA NA
9: 3 b 2002 7 10 15 NA NA
10: 5 b 2003 6 NA NA NA NA
11: 5 b 2004 9 15 NA NA NA
12: 4 c 2000 31 NA NA NA NA
13: 4 c 2001 5 36 NA NA NA
14: 4 c 2002 6 11 42 NA NA
15: 4 c 2003 7 13 18 49 NA
16: 4 c 2004 44 51 57 62 93
17: 4 c 2005 33 77 84 90 95
18: 4 c 2006 2 35 79 86 92
To solve your specific question, this is a tidyverse solution:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
mutate(
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]]),
fyear5=rowSums(list(sapply(1:5, function(x) lag(data, x)))[[1]])
) %>%
ungroup()
# A tibble: 18 × 6
ID C fyear data fyear3 fyear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA NA
8 3 b 2001 3 NA NA
9 3 b 2002 7 NA NA
10 5 b 2003 6 NA NA
11 5 b 2004 9 NA NA
12 4 c 2000 31 NA NA
13 4 c 2001 5 NA NA
14 4 c 2002 6 NA NA
15 4 c 2003 7 42 NA
16 4 c 2004 44 18 NA
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
The first mutate is a little hairy, so lets break one of the assignments down...
Find the nth lagged values of the data column, for n=1, 2 and 3.
sapply(1:3, function(x) lag(data, x))
Changes in CEO and Company are handled by the group_by() earlier in the pipe.
Create a list of these lagged values.
list(sapply(1:3, function(x) lag(data, x)))[[1]]
Row by row, calculate the sums of the lagged values
fyear3=rowSums(list(sapply(1:3, function(x) lag(data, x)))[[1]])
Now generalise the problem. Write a function takes as its inputs a dataset (so it works in a pipe), the new column, the column containing the values for which a lagged sum is required, and an integer defining the maximum lag.
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
ungroup()
}
The embracing ({{ and }}) and use of := is required to handle tidyverse's non-standard evaluation (NSE).
Now use the function.
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 3) %>%
lagSum(sumFYear5, data, 5)
# A tibble: 18 × 6
ID C fyear data sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 NA NA
4 1 a 2003 3 102 NA
5 1 a 2004 6 75 NA
6 1 a 2005 11 31 111
7 3 b 2000 5 NA 92
8 3 b 2001 3 NA 47
9 3 b 2002 7 NA 28
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 32
12 4 c 2000 31 NA 30
13 4 c 2001 5 NA 56
14 4 c 2002 6 NA 58
15 4 c 2003 7 42 57
16 4 c 2004 44 18 58
17 4 c 2005 33 57 93
18 4 c 2006 2 84 95
EDIT
I misunderstood what you meant by "lag" and didn't read your description properly. My apologies.
I think your 86 in row 6 of your data5 column should be 92. if not, please explain why not.
Getting the answers you want should be a simple matter of adapting the function I wrote. For example:
lagSum <- function(data, newCol, valueCol, maxLag) {
data %>%
mutate(
{{newCol}} := {{valueCol}} + rowSums(
list(
sapply(
1:maxLag,
function(x) lag({{valueCol}}, x)
)
)[[1]]
)
) %>%
mutate() %>%
ungroup()
}
Gives
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2)
# A tibble: 18 × 5
ID C fyear value sumFYear3
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 102
4 1 a 2003 3 75
5 1 a 2004 6 31
6 1 a 2005 11 20
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 15
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 42
15 4 c 2003 7 18
16 4 c 2004 44 57
17 4 c 2005 33 84
18 4 c 2006 2 79
and
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, data, 4)
# A tibble: 18 × 5
ID C fyear data sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA
2 1 a 2001 50 NA
3 1 a 2002 22 NA
4 1 a 2003 3 NA
5 1 a 2004 6 111
6 1 a 2005 11 92
7 3 b 2000 5 NA
8 3 b 2001 3 NA
9 3 b 2002 7 NA
10 5 b 2003 6 NA
11 5 b 2004 9 NA
12 4 c 2000 31 NA
13 4 c 2001 5 NA
14 4 c 2002 6 NA
15 4 c 2003 7 NA
16 4 c 2004 44 93
17 4 c 2005 33 95
18 4 c 2006 2 92
as expected, but
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2) %>%
lagSum(sumFYear5, data, 4)
# A tibble: 18 × 6
ID C fyear data sumFYear3 sumFYear5
<dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 1 a 2000 30 NA NA
2 1 a 2001 50 NA NA
3 1 a 2002 22 102 NA
4 1 a 2003 3 75 NA
5 1 a 2004 6 31 111
6 1 a 2005 11 20 92
7 3 b 2000 5 NA 47
8 3 b 2001 3 NA 28
9 3 b 2002 7 15 32
10 5 b 2003 6 NA 32
11 5 b 2004 9 NA 30
12 4 c 2000 31 NA 56
13 4 c 2001 5 NA 58
14 4 c 2002 6 42 57
15 4 c 2003 7 18 58
16 4 c 2004 44 57 93
17 4 c 2005 33 84 95
18 4 c 2006 2 79 92
Not as expected. At the moment, I cannot explain why. I managed to get the correct answers for both 3 and 5 year lags in the same pipe with:
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear3, data, 2) %>%
left_join(
df1 %>%
arrange(C, ID, fyear) %>%
group_by(C, ID) %>%
lagSum(sumFYear5, data, 4)
)
But that shouldn't be necessary. I will think about this some more and may post a question of my own if I can't find an explanation.
Alternatively, this question gives a solution using the zoo package.

Match and re-order rows in multiple columns in R (tidyverse)

I have a dataset like this (in the actual dataset, I have more columns like subj01):
# A tibble: 10 x 4
item subj01 subj02 subj03
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 6
3 3 5 5 9
4 4 9 6 NA
5 5 10 8 NA
6 6 NA 9 NA
7 7 NA 10 NA
8 8 NA NA NA
9 9 NA NA NA
10 10 NA NA NA
I created the dataset using the code below.
data = tibble(item = 1:10, subj01 = c(1,2,5,9,10,NA,NA,NA,NA,NA), subj02 = c(1,2,5,6,8,9,10,NA,NA,NA), subj03 = c(1,6,9,NA,NA,NA,NA,NA,NA,NA))
I would like to reorder all the columns beginning with "subj" so that the position of the values match that in the item column.
That is, for this example dataset, I would like to end up with this:
# A tibble: 10 x 4
item subj01 subj02 subj03
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 NA
3 3 NA NA NA
4 4 NA NA NA
5 5 5 5 NA
6 6 NA 6 6
7 7 NA NA NA
8 8 NA 8 NA
9 9 9 9 9
10 10 10 10 NA
I've figured that I can match and re-order one column by running this:
data$subj01[match(data$item,data$subj01)]
[1] 1 2 NA NA 5 NA NA NA 9 10
But I am struggling to apply this across multiple columns (ideally I'd like to embed the command in a dplyr pipe).
I tried the command below, but this gave me an error "Error in mutate(x. = x.[match(item, x.)]) : object 'x.' not found".
data = data %>% across(mutate(x.=x.[match(item,x.)]))
I'd appreciate any suggestions! Thank you.
library(tidyverse)
data %>%
pivot_longer(-item) %>%
filter(!is.na(value)) %>%
mutate(item = value) %>%
complete(item = 1:10, name) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 10 × 4
item subj01 subj02 subj03
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 NA
3 3 NA NA NA
4 4 NA NA NA
5 5 5 5 NA
6 6 NA 6 6
7 7 NA NA NA
8 8 NA 8 NA
9 9 9 9 9
10 10 10 10 NA

Parsing lags in a time series

I'm making a table of lagged columns for time series data, but I'm having trouble reshaping the data.
My original data.table looks like this:
A data table sorted by descending years column and a doy column with one value, the n_a column starts from 9 and descends to 4
And I want to make it look like this:
Lagged variable time series table where each column starts with the row after the prev
Assuming your data frame is called df, you can do:
df[4:7] <- lapply(1:4, function(x) dplyr::lead(df$n_a, x))
names(df)[4:7] <- paste0('n_a_lag', 1:4)
df
#> doy year n_a n_a_lag1 n_a_lag2 n_a_lag3 n_a_lag4
#> 1 1 2022 9 8 7 6 5
#> 2 1 2021 8 7 6 5 4
#> 3 1 2020 7 6 5 4 NA
#> 4 1 2019 6 5 4 NA NA
#> 5 1 2018 5 4 NA NA NA
#> 6 1 2017 4 NA NA NA NA
Data taken from image in question, in reproducible format
df <- data.frame(doy = 1, year = 2022:2017, n_a = 9:4)
df
#> doy year n_a
#> 1 1 2022 9
#> 2 1 2021 8
#> 3 1 2020 7
#> 4 1 2019 6
#> 5 1 2018 5
#> 6 1 2017 4
Created on 2022-07-21 by the reprex package (v2.0.1)
You can use data.table::shift for a multiple leads and lags:
d[,(paste0("n_a_lag",1:4)):= shift(n_a,1:4,type = "lead")]
Output:
doy year n_a n_a_lag1 n_a_lag2 n_a_lag3 n_a_lag4
<num> <int> <int> <int> <int> <int> <int>
1: 1 2022 9 8 7 6 5
2: 1 2021 8 7 6 5 4
3: 1 2020 7 6 5 4 NA
4: 1 2019 6 5 4 NA NA
5: 1 2018 5 4 NA NA NA
6: 1 2017 4 NA NA NA NA
Input:
d = data.table(
doy =c(1,1,1,1,1,1),
year = 2022:2017,
n_a=9:4
)

Remove NAs in each column by group

I have a dataframe with rows grouped by Year. Variables don't always have observations in each year but when they do, there are 3 observations in that year but appear in different rows.
> na_data
Year Peter Paul John
1 2011 1 NA NA
2 2011 2 NA NA
3 2011 3 NA NA
4 2011 NA 1 NA
5 2011 NA 2 NA
6 2011 NA 3 NA
7 2012 1 NA NA
8 2012 NA 3 NA
9 2012 2 NA NA
10 2012 NA 2 NA
11 2012 3 NA NA
12 2012 NA 1 NA
13 2013 NA 1 4
14 2013 NA 2 5
15 2013 NA 3 6
16 2013 1 NA NA
17 2013 2 NA NA
18 2013 3 NA NA
I want to remove the NAs in each column by group. Such that the output looks like this:
final_data
Year Peter Paul John
[1,] 2011 1 1 NA
[2,] 2011 2 2 NA
[3,] 2011 3 3 NA
[4,] 2012 1 3 NA
[5,] 2012 2 2 NA
[6,] 2012 3 1 NA
[7,] 2013 1 1 4
[8,] 2013 2 2 5
[9,] 2013 3 3 6
So far I have used a loop but I am looking for a cleaner solution if anyone can help that would be great. My solution:
cleaned_list <- vector("list", length(unique(full_data$Year)))
names(cleaned_list) <- unique(full_data$Year)
for(yr in unique(na_data$Year)) {
temp <- matrix(NA, nrow = 3, ncol = ncol(na_data),
dimnames = list(NULL, colnames(na_data)))
for(name in colnames(na_data)[-1]){
no_nas <- as.vector(na.omit(na_data[Year==yr, name]))
if (length(no_nas)!=0) temp[,name] <- no_nas
}
temp[,1] <- yr
cleaned_list[[as.character(yr)]] <- temp
}
final_data <- do.call("rbind", cleaned_list)
Data:
na_data <- data.frame(
Year = rep(c(2011,2012,2013), each = 6),
Peter = c(1:3, rep(NA, 3), 1,NA,2,NA,3,NA, rep(NA, 3),1:3),
Paul = c(rep(NA,3), 1:3, NA,3,NA,2,NA, 1, 1:3, rep(NA,3)),
John = c(rep(NA, 12), 4:6, rep(NA, 3))
)
desired <- data.frame(
Year = rep(c(2011,2012,2013), each = 3),
Peter = c(1:3, 1:3, 1:3),
Paul = c( 1:3, 3:1, 1:3),
John = c(rep(NA, 6), 4:6)
) # same as final_data but a dataframe
Here is one possible solution using data.table package:
library(data.table)
setDT(na_data)[, lapply(.SD, function(x) if(length(y<-na.omit(x))) y else first(x)), by=Year]
# Year Peter Paul John
# 1: 2011 1 1 NA
# 2: 2011 2 2 NA
# 3: 2011 3 3 NA
# 4: 2012 1 3 NA
# 5: 2012 2 2 NA
# 6: 2012 3 1 NA
# 7: 2013 1 1 4
# 8: 2013 2 2 5
# 9: 2013 3 3 6
dplyr equivalent:
library(dplyr)
na_data |>
group_by(Year) |>
summarise(across(.fns = ~ if(length(y<-na.omit(.x))) y else first(.x)))
# # A tibble: 9 x 4
# # Groups: Year [3]
# Year Peter Paul John
# <dbl> <dbl> <dbl> <int>
# 1 2011 1 1 NA
# 2 2011 2 2 NA
# 3 2011 3 3 NA
# 4 2012 1 3 NA
# 5 2012 2 2 NA
# 6 2012 3 1 NA
# 7 2013 1 1 4
# 8 2013 2 2 5
# 9 2013 3 3 6
Convert to long form, remove the NA's, add a sequence number n, convert back and remove n.
library(dplyr)
library(tidyr)
na_data %>%
pivot_longer(-Year) %>%
drop_na %>%
group_by(Year, name) %>%
mutate(n = 1:n()) %>%
ungroup %>%
pivot_wider %>%
select(-n)
giving:
# A tibble: 9 x 4
Year Paul Peter John
<dbl> <dbl> <dbl> <dbl>
1 2011 1 1 NA
2 2011 2 2 NA
3 2011 3 3 NA
4 2012 1 1 NA
5 2012 2 2 NA
6 2012 3 3 NA
7 2013 1 1 4
8 2013 2 2 5
9 2013 3 3 6

R: How can I group rows in a dataframe, ID rows meeting a condition, then delete prior rows for the group?

I have a dataframe of customers (identified by ID number), the number of units of two products they bought in each of four years, and a final column identifying the year in which new customers first purchased (the 'key' column). The problem: the dataframe includes rows from the years prior to new customers purchasing for the first time. I need to delete these rows. For example, this dataframe:
customer year item.A item.B key
1 1 2000 NA NA <NA>
2 1 2001 NA NA <NA>
3 1 2002 1 5 new.customer
4 1 2003 2 6 <NA>
5 2 2000 NA NA <NA>
6 2 2001 NA NA <NA>
7 2 2002 NA NA <NA>
8 2 2003 2 7 new.customer
9 3 2000 2 4 <NA>
10 3 2001 6 4 <NA>
11 3 2002 2 5 <NA>
12 3 2003 1 8 <NA>
needs to look like this:
customer year item.A item.B key
1 1 2002 1 5 new.customer
2 1 2003 2 6 <NA>
3 2 2003 2 7 new.customer
4 3 2000 2 4 <NA>
5 3 2001 6 4 <NA>
6 3 2002 2 5 <NA>
7 3 2003 1 8 <NA>
I thought I could do this using dplyr/tidyr - a combination of group, lead/lag, and slice (or perhaps filter and drop_na) but I can't figure out how to delete backwards in the customer group once I've identified the rows meeting the condition "key"=="new.customer". Thanks for any suggestions (code for the full dataframe below).
a<-c(1,1,1,1,2,2,2,2,3,3,3,3)
b<-c(2000,2001,2002,2003,2000,2001,2002,2003,2000,2001,2002,2003)
c<-c(NA,NA,1,2,NA,NA,NA,2,2,6,2,1)
d<-c(NA,NA,5,6,NA,NA,NA,7,4,4,5,8)
e<-c(NA,NA,"new",NA,NA,NA,NA,"new",NA,NA,NA,NA)
df <- data.frame("customer" =a, "year" = b, "C" = c, "D" = d,"key"=e)
df
As a first step I am marking existing customers (customer 3 in this case) in the key column -
df %>%
group_by(customer) %>%
mutate(
key = as.character(key), # can be avoided if key is a character to begin with
key = ifelse(row_number() == 1 & (!is.na(C) | !is.na(D)), "existing", key)
) %>%
filter(cumsum(!is.na(key)) > 0) %>%
ungroup()
# A tibble: 7 x 5
customer year C D key
<dbl> <dbl> <dbl> <dbl> <chr>
1 1 2002 1 5 new
2 1 2003 2 6 NA
3 2 2003 2 7 new
4 3 2000 2 4 existing
5 3 2001 6 4 NA
6 3 2002 2 5 NA
7 3 2003 1 8 NA

Resources