Parsing lags in a time series - r

I'm making a table of lagged columns for time series data, but I'm having trouble reshaping the data.
My original data.table looks like this:
A data table sorted by descending years column and a doy column with one value, the n_a column starts from 9 and descends to 4
And I want to make it look like this:
Lagged variable time series table where each column starts with the row after the prev

Assuming your data frame is called df, you can do:
df[4:7] <- lapply(1:4, function(x) dplyr::lead(df$n_a, x))
names(df)[4:7] <- paste0('n_a_lag', 1:4)
df
#> doy year n_a n_a_lag1 n_a_lag2 n_a_lag3 n_a_lag4
#> 1 1 2022 9 8 7 6 5
#> 2 1 2021 8 7 6 5 4
#> 3 1 2020 7 6 5 4 NA
#> 4 1 2019 6 5 4 NA NA
#> 5 1 2018 5 4 NA NA NA
#> 6 1 2017 4 NA NA NA NA
Data taken from image in question, in reproducible format
df <- data.frame(doy = 1, year = 2022:2017, n_a = 9:4)
df
#> doy year n_a
#> 1 1 2022 9
#> 2 1 2021 8
#> 3 1 2020 7
#> 4 1 2019 6
#> 5 1 2018 5
#> 6 1 2017 4
Created on 2022-07-21 by the reprex package (v2.0.1)

You can use data.table::shift for a multiple leads and lags:
d[,(paste0("n_a_lag",1:4)):= shift(n_a,1:4,type = "lead")]
Output:
doy year n_a n_a_lag1 n_a_lag2 n_a_lag3 n_a_lag4
<num> <int> <int> <int> <int> <int> <int>
1: 1 2022 9 8 7 6 5
2: 1 2021 8 7 6 5 4
3: 1 2020 7 6 5 4 NA
4: 1 2019 6 5 4 NA NA
5: 1 2018 5 4 NA NA NA
6: 1 2017 4 NA NA NA NA
Input:
d = data.table(
doy =c(1,1,1,1,1,1),
year = 2022:2017,
n_a=9:4
)

Related

Match and re-order rows in multiple columns in R (tidyverse)

I have a dataset like this (in the actual dataset, I have more columns like subj01):
# A tibble: 10 x 4
item subj01 subj02 subj03
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 6
3 3 5 5 9
4 4 9 6 NA
5 5 10 8 NA
6 6 NA 9 NA
7 7 NA 10 NA
8 8 NA NA NA
9 9 NA NA NA
10 10 NA NA NA
I created the dataset using the code below.
data = tibble(item = 1:10, subj01 = c(1,2,5,9,10,NA,NA,NA,NA,NA), subj02 = c(1,2,5,6,8,9,10,NA,NA,NA), subj03 = c(1,6,9,NA,NA,NA,NA,NA,NA,NA))
I would like to reorder all the columns beginning with "subj" so that the position of the values match that in the item column.
That is, for this example dataset, I would like to end up with this:
# A tibble: 10 x 4
item subj01 subj02 subj03
<int> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 NA
3 3 NA NA NA
4 4 NA NA NA
5 5 5 5 NA
6 6 NA 6 6
7 7 NA NA NA
8 8 NA 8 NA
9 9 9 9 9
10 10 10 10 NA
I've figured that I can match and re-order one column by running this:
data$subj01[match(data$item,data$subj01)]
[1] 1 2 NA NA 5 NA NA NA 9 10
But I am struggling to apply this across multiple columns (ideally I'd like to embed the command in a dplyr pipe).
I tried the command below, but this gave me an error "Error in mutate(x. = x.[match(item, x.)]) : object 'x.' not found".
data = data %>% across(mutate(x.=x.[match(item,x.)]))
I'd appreciate any suggestions! Thank you.
library(tidyverse)
data %>%
pivot_longer(-item) %>%
filter(!is.na(value)) %>%
mutate(item = value) %>%
complete(item = 1:10, name) %>%
pivot_wider(names_from = name, values_from = value)
# A tibble: 10 × 4
item subj01 subj02 subj03
<dbl> <dbl> <dbl> <dbl>
1 1 1 1 1
2 2 2 2 NA
3 3 NA NA NA
4 4 NA NA NA
5 5 5 5 NA
6 6 NA 6 6
7 7 NA NA NA
8 8 NA 8 NA
9 9 9 9 9
10 10 10 10 NA

R get row as list

I have a tibble dataframe. When I run df[[1]] I can get the first column as a list but I'm wondering how do I do the same for each row?
The data frame looks something like this:
# A tibble: 9 x 4
# Groups: year [9]
year value1 value2 value3
<dbl> <int> <int> <int>
1 2001 NA 3 4
2 2002 8 3 4
3 2003 4 3 NA
4 2004 NA NA 1
5 2005 9 NA 1
6 2006 1 NA NA
7 2007 NA 5 NA
8 2008 9 5 NA
9 2009 NA 5 NA

Using numerical columns to name non-numerical columns (idk if that makes sense)

I have a table that includes 2 columns. One has values from 1-12 and the other is all NA. I would like to write code so that, if a row contains the numbers 1,2,3,4,11,12 in the numerical column, the other column reads "Winter". If a row contains the numbers 5,6,7,8, "Summer", and 9,10, "Fall". How would I do this in R?
Try this
df <- data.frame(x = 1:12 , y = NA)
df
#> x y
#> 1 1 NA
#> 2 2 NA
#> 3 3 NA
#> 4 4 NA
#> 5 5 NA
#> 6 6 NA
#> 7 7 NA
#> 8 8 NA
#> 9 9 NA
#> 10 10 NA
#> 11 11 NA
#> 12 12 NA
df$y <- ifelse(df$x %in% c(1,2,3,4,11,12) ,
"Winter" , ifelse(df$x %in% c(9,10) ,"Fall" ,"Summer" ))
df
#> x y
#> 1 1 Winter
#> 2 2 Winter
#> 3 3 Winter
#> 4 4 Winter
#> 5 5 Summer
#> 6 6 Summer
#> 7 7 Summer
#> 8 8 Summer
#> 9 9 Fall
#> 10 10 Fall
#> 11 11 Winter
#> 12 12 Winter
Created on 2022-06-15 by the reprex package (v2.0.1)

Removing groups with all NA in Data.Table or DPLYR in R

dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataWANT=data.frame("student"=c(1,1,1,3,3,3,5,5,5),
"time"=c(1,2,3,1,2,3,NA,2,3),
"score"=c(7,9,5,NA,3,9,7,NA,5))
I have a tall dataframe and in that data frame I want to remove student IDS that contain NA for all 'score' or for all 'time'. This is just if it is all NA, if there are some NA then I want to keep all their records...
Is this what you want?
library(dplyr)
dataHAVE %>%
group_by(student) %>%
filter(!all(is.na(score)))
student time score
<dbl> <dbl> <dbl>
1 1 1 7
2 1 2 9
3 1 3 5
4 3 1 NA
5 3 2 3
6 3 3 9
7 5 NA 7
8 5 2 NA
9 5 3 5
Each student is only kept if not (!) all score values are NA
Since nobody suggested one, here is a solution using data.table:
library(data.table)
dataHAVE = data.table("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
Edit:
Previous but wrong code:
dataHAVE[, .SD[!(all(is.na(time)) & all(is.na(score)))], by = student]
New and correct code:
dataHAVE[, .SD[!(all(is.na(time)) | all(is.na(score)))], by = student]
Returns:
student time score
1: 1 1 7
2: 1 2 9
3: 1 3 5
4: 3 1 NA
5: 3 2 3
6: 3 3 9
7: 5 NA 7
8: 5 2 NA
9: 5 3 5
Edit:
Updatet data.table solution with #Cole s suggestion...
Here is a base R solution using subset + ave
dataWANT <- subset(dataHAVE,!(ave(time,student,FUN = function(v) all(is.na(v))) | ave(score,student,FUN = function(v) all(is.na(v)))))
or
dataWANT <- subset(dataHAVE,
!Reduce(`|`,Map(function(x) ave(get(x),student,FUN = function(v) all(is.na(v))), c("time","score"))))
Another option:
library(data.table)
setDT(dataHAVE, key="student")
dataHAVE[!student %in% dataHAVE[, if(any(colSums(is.na(.SD))==.N)) student, student]$V1]
Create a dummy variable, and filter based on that
library("dplyr")
dataHAVE = data.frame("student"=c(1,1,1,2,2,2,3,3,3,4,4,4,5,5,5),
"time"=c(1,2,3,1,2,3,1,2,3,NA,NA,NA,NA,2,3),
"score"=c(7,9,5,NA,NA,NA,NA,3,9,NA,NA,NA,7,NA,5))
dataHAVE %>%
mutate(check=is.na(time)&is.na(score)) %>%
filter(check == FALSE) %>%
select(-check)
#> student time score
#> 1 1 1 7
#> 2 1 2 9
#> 3 1 3 5
#> 4 2 1 NA
#> 5 2 2 NA
#> 6 2 3 NA
#> 7 3 1 NA
#> 8 3 2 3
#> 9 3 3 9
#> 10 5 NA 7
#> 11 5 2 NA
#> 12 5 3 5
Created on 2020-02-21 by the reprex package (v0.3.0)
data.table solution generalising to any number of columns:
dataHAVE[,
.SD[do.call("+", lapply(.SD, function(x) any(!is.na(x)))) == ncol(.SD)],
by = student]
# student time score
# 1: 1 1 7
# 2: 1 2 9
# 3: 1 3 5
# 4: 3 1 NA
# 5: 3 2 3
# 6: 3 3 9
# 7: 5 NA 7
# 8: 5 2 NA
# 9: 5 3 5

How to "extrapolate" values of panel data in R?

I have a panel data with NA values like below:
uid year month day value
1 1 2016 8 1 NA
2 1 2016 8 2 NA
3 1 2016 8 3 30
4 1 2016 8 4 NA
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 NA
8 2 2016 8 3 50
9 2 2016 8 4 NA
10 2 2016 8 5 NA
I would like to perform a linear interpolation, so I wrote this code:
library(dplyr)
library(zoo)
panel_df <- group_by(panel_df, userid)
panel_df <- mutate(panel_df, value=na.approx(value, na.rm=FALSE))
then I get the output:
uid year month day value
1 1 2016 8 1 NA
2 1 2016 8 2 NA
3 1 2016 8 3 30
4 1 2016 8 4 25
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 45
8 2 2016 8 3 50
9 2 2016 8 4 NA
10 2 2016 8 5 NA
Here the approx method interpolates NA values successfully but does not extrapolate.
Is there any good way to replace the value of the 1st and 2nd rows with first non-NA value of this user(30)? Similary, how I can replace the value of the 9th and 10th rows with last non-NA value of this user(50)?
One way to do this is by using na.spline() from same package zoo:
panel_df <- group_by(panel_df, uid)
panel_df <- mutate(panel_df, value=na.spline(value))
panel_df
Source: local data frame [10 x 5]
Groups: uid [2]
uid year month day value
<int> <int> <int> <int> <dbl>
1 1 2016 8 1 40
2 1 2016 8 2 35
3 1 2016 8 3 30
4 1 2016 8 4 25
5 1 2016 8 5 20
6 2 2016 8 1 40
7 2 2016 8 2 45
8 2 2016 8 3 50
9 2 2016 8 4 55
10 2 2016 8 5 60

Resources