debugging: function to create multiple lags for multiple columns (dplyr) - r

I want to create multiple lags of multiple variables, so I thought writing a function would be helpful. My code throws a warning ("Truncating vector to length 1 ") and false results:
library(dplyr)
time <- c(2000:2009, 2000:2009)
x <- c(1:10, 10:19)
id <- c(1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,2,2)
df <- data.frame(id, time, x)
three_lags <- function (data, column, group, ordervar) {
data <- data %>%
group_by_(group) %>%
mutate(a = lag(column, 1L, NA, order_by = ordervar),
b = lag(column, 2L, NA, order_by = ordervar),
c = lag(column, 3L, NA, order_by = ordervar))
}
df_lags <- three_lags(data=df, column=x, group=id, ordervar=time) %>%
arrange(id, time)
Also I wondered if there might be a more elegant solution using mutate_each, but I didn't get that to work either. I can of course just write a long code with a line for each new lagged variable, but Id like to avoid that.
EDIT:
akrun's dplyr answer works, but takes a long time to compute for large data frames. The solution using data.table seems to be more efficient. So a dplyr or other solution that also allows the be implemented for several columns & several lags is still to be found.
EDIT 2:
For multiple columns and no groups (e.g. "ID") the following solution seems very well suited to me, due to its simplicity. The code may of course be shortened, but step by step:
df <- arrange(df, time)
df.lag <- shift(df[,1:24], n=1:3, give.names = T) ##column indexes of columns to be lagged as "[,startcol:endcol]", "n=1:3" sepcifies the number of lags (lag1, lag2 and lag3 in this case)
df.result <- bind_cols(df, df.lag)

We can use shift from data.table which can take multiple values for 'n'
library(data.table)
setDT(df)[order(time), c("a", "b", "c") := shift(x, 1:3) , id][order(id, time)]
Suppose, we need to do this on multiple columns
df$y <- df$x
setDT(df)[order(time), paste0(rep(c("x", "y"), each =3),
c("a", "b", "c")) :=shift(.SD, 1:3), id, .SDcols = x:y]
The shift can also be used in the dplyr
library(dplyr)
df %>%
group_by(id) %>%
arrange(id, time) %>%
do(data.frame(., setNames(shift(.$x, 1:3), c("a", "b", "c"))))
# id time x a b c
# <dbl> <int> <int> <int> <int> <int>
#1 1 2000 1 NA NA NA
#2 1 2001 2 1 NA NA
#3 1 2002 3 2 1 NA
#4 1 2003 4 3 2 1
#5 1 2004 5 4 3 2
#6 1 2005 6 5 4 3
#7 1 2006 7 6 5 4
#8 1 2007 8 7 6 5
#9 1 2008 9 8 7 6
#10 1 2009 10 9 8 7
#11 2 2000 10 NA NA NA
#12 2 2001 11 10 NA NA
#13 2 2002 12 11 10 NA
#14 2 2003 13 12 11 10
#15 2 2004 14 13 12 11
#16 2 2005 15 14 13 12
#17 2 2006 16 15 14 13
#18 2 2007 17 16 15 14
#19 2 2008 18 17 16 15
#20 2 2009 19 18 17 16

Could also create a function that will output a tibble:
library(tidyverse)
lag_multiple <- function(x, n_vec){
map(n_vec, lag, x = x) %>%
set_names(paste0("lag", n_vec)) %>%
as_tibble()
}
tibble(x = 1:30) %>%
mutate(lag_multiple(x, 1:5))
#> # A tibble: 30 x 6
#> x lag1 lag2 lag3 lag4 lag5
#> <int> <int> <int> <int> <int> <int>
#> 1 1 NA NA NA NA NA
#> 2 2 1 NA NA NA NA
#> 3 3 2 1 NA NA NA
#> 4 4 3 2 1 NA NA
#> 5 5 4 3 2 1 NA
#> 6 6 5 4 3 2 1
#> 7 7 6 5 4 3 2
#> 8 8 7 6 5 4 3
#> 9 9 8 7 6 5 4
#> 10 10 9 8 7 6 5
#> # ... with 20 more rows

Related

Is there a way to group values in a column between data gaps in R?

I want to group my data in different chunks when the data is continuous. Trying to get the group column from dummy data like this:
a b group
<dbl> <dbl> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
I tried using
test %>% mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test == TRUE)) %>%
select(group, everything())
But it doesn't work as expected:
group a b test
<int> <dbl> <dbl> <lgl>
1 1 1 1 TRUE
2 2 2 2 TRUE
3 3 3 3 TRUE
4 3 4 NA FALSE
5 3 5 NA FALSE
6 3 6 NA FALSE
7 4 7 12 TRUE
8 5 8 15 TRUE
9 5 9 NA FALSE
10 6 10 25 TRUE
Any advice?
Using rle in base R -
transform(df, group1 = with(rle(!is.na(b)), rep(cumsum(values), lengths))) |>
transform(group1 = replace(group1, is.na(b), NA))
# a b group group1
#1 1 1 1 1
#2 2 2 1 1
#3 3 3 1 1
#4 4 NA NA NA
#5 5 NA NA NA
#6 6 NA NA NA
#7 7 12 2 2
#8 8 15 2 2
#9 9 NA NA NA
#10 10 25 3 3
A couple of approaches to consider if you wish to use dplyr for this.
First, you could look at transition from non-complete cases (using lag) to complete cases.
library(dplyr)
test %>%
mutate(test = complete.cases(.)) %>%
group_by(group = cumsum(test & !lag(test, default = F))) %>%
mutate(group = replace(group, !test, NA))
Alternatively, you could add row numbers to your data.frame. Then, you could filter to include only complete cases, and group_by enumerating with cumsum based on gaps in row numbers. Then, join back to original data.
test$rn <- seq.int(nrow(test))
test %>%
filter(complete.cases(.)) %>%
group_by(group = c(0, cumsum(diff(rn) > 1)) + 1) %>%
right_join(test) %>%
arrange(rn) %>%
dplyr::select(-rn)
Output
a b group
<int> <int> <dbl>
1 1 1 1
2 2 2 1
3 3 3 1
4 4 NA NA
5 5 NA NA
6 6 NA NA
7 7 12 2
8 8 15 2
9 9 NA NA
10 10 25 3
Using data.table, get rleid then remove group IDs for NAs, then fix the sequence with factor to integer conversion:
library(data.table)
setDT(test)[, group1 := {
x <- complete.cases(test)
grp <- rleid(x)
grp[ !x ] <- NA
as.integer(factor(grp))
}]
# a b group group1
# 1: 1 1 1 1
# 2: 2 2 1 1
# 3: 3 3 1 1
# 4: 4 NA NA NA
# 5: 5 NA NA NA
# 6: 6 NA NA NA
# 7: 7 12 2 2
# 8: 8 15 2 2
# 9: 9 NA NA NA
# 10: 10 25 3 3

Update a variable if dplyr filter conditions are met

With the command df %>% filter(is.na(df)[,2:4]) filter function subset in a new df that has rows with NA's in columns 2, 3 and 4. What I want is not a new subsetted df but rather assign in example "1" to a new variable called "Exclude" in the actual df.
This example with mutate was not exactly what I was looking for, but close:
Use dplyr´s filter and mutate to generate a new variable
Also I would need the same to happen with other filter conditions.
Example I have the following:
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3,2:4] <- NA
df[5,2:4] <- NA
df
> df
A B C D
1 1 11 21 31
2 2 12 22 32
3 3 NA NA NA
4 4 14 24 34
5 5 NA NA NA
6 6 16 26 36
and would like
> df
A B C D Exclude
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Any good ideas how the filter subset could be used to update easy? The hard way work around would be to generate this subset, create new variable for all and then join back but that is not tidy code.
We can do this with base R using vectorized rowSums
df$Exclude <- NA^!rowSums(is.na(df[-1]))
-output
df
# A B C D Exclude
#1 1 11 21 31 NA
#2 2 12 22 32 NA
#3 3 NA NA NA 1
#4 4 14 24 34 NA
#5 5 NA NA NA 1
#6 6 16 26 36 NA
Does this work:
library(dplyr)
df %>% rowwise() %>%
mutate(Exclude = +any(is.na(c_across(everything()))), Exclude = na_if(Exclude, 0))
# A tibble: 6 x 5
# Rowwise:
A B C D Exclude
<int> <int> <int> <int> <int>
1 1 11 21 31 NA
2 2 12 22 32 NA
3 3 NA NA NA 1
4 4 14 24 34 NA
5 5 NA NA NA 1
6 6 16 26 36 NA
Using anyNA.
df %>% mutate(Exclude=ifelse(apply(df[2:4], 1, anyNA), 1, NA))
# A B C D Exclude
# 1 1 11 21 31 NA
# 2 2 12 22 32 NA
# 3 3 NA NA NA 1
# 4 4 14 24 34 NA
# 5 5 NA NA NA 1
# 6 6 16 26 36 NA
Or just
df$Exclude <- ifelse(apply(df[2:4], 1, anyNA), 1, NA)
Another one-line solution:
df$Exclude <- as.numeric(apply(df[2:4], 1, function(x) any(is.na(x))))
Use rowwise, sum over all numeric columns, assign 1 or NA in ifelse.
df <- data.frame(A = 1:6, B = 11:16, C = 21:26, D = 31:36)
df[3, 2:4] <- NA
df[5, 2:4] <- NA
library(tidyverse)
df %>%
rowwise() %>%
mutate(Exclude = ifelse(
is.na(sum(c_across(where(is.numeric)))), 1, NA
))
#> # A tibble: 6 x 5
#> # Rowwise:
#> A B C D Exclude
#> <int> <int> <int> <int> <dbl>
#> 1 1 11 21 31 NA
#> 2 2 12 22 32 NA
#> 3 3 NA NA NA 1
#> 4 4 14 24 34 NA
#> 5 5 NA NA NA 1
#> 6 6 16 26 36 NA

replacing missing value with non-values in grouped data using tidyverse

For each id, I am trying to replace missing values with data that is available.
library(tidyverse)
df <- data.frame(id=c(1,1,1,2,2,2,3),
a=c(NA, NA, 10, NA, 12, NA, 10),
b=c(10, NA, NA, NA, 13,NA, NA))
> df
id a b
1 1 NA 10
2 1 NA NA
3 1 10 NA
4 2 NA NA
5 2 12 13
6 2 NA NA
7 3 10 NA
I have tried:
df %>%
dplyr::group_by(id) %>%
dplyr::mutate_at(vars(a:b), fill(., direction="up"))
and get the following error:
Error: 1 components of `...` had unexpected names.
We detected these problematic arguments:
* `direction`
Did you misspecify an argument?
Desired output:
id a b
1 1 10 10
2 1 10 NA
3 1 10 NA
4 2 12 13
5 2 12 13
6 2 12 13
7 3 10 NA
We dont' use fill with mutate_at. According to ?fill
data - A data frame. and
... - A selection of columns. If empty, nothing happens. You can supply bare variable names, select all variables between x and z with x:z, exclude y with -y. F
library(dplyr)
library(tidyr)
df %>%
group_by(id) %>%
fill(a:b, .direction = 'up')
# A tibble: 7 x 3
# Groups: id [3]
# id a b
# <dbl> <dbl> <dbl>
#1 1 10 10
#2 1 10 NA
#3 1 10 NA
#4 2 12 13
#5 2 12 13
#6 2 NA NA
#7 3 10 NA

Generating multiple column to sort the data out in R

I have a database including names, codes and rooms as follows:
Name1 Code1 R1
A A 12 1
A B 13 2
A C 15 5
A B 8 4
A C 13 2
A D 17 1
A B 16 7
I want to generate columns for the repeated names like this:
Name1 Code1 R1 Name2 Code2 R2 Name3 Cod3 R3
A A 12 1
A B 13 2
A C 15 5
A B 8 4 A B 8 4
A C 13 2 A C 13 2
A D 17 1
A B 16 7 A B 16 7
I have googled to find a solution, but I could not find or may be I have missed something. Would it be possible for you to help me. Some names (Name1) has been repeated 5 times and i did not add it.So I I have Name2 Code2 R2; Name3, Code3, R3...
Sample data:
df <- read.table(stringsAsFactors = F, header = T, text = "
Name1a Name1b Code1 R1
1 A A 12 1
2 A B 13 2
3 A C 15 5
4 A B 8 4
5 A C 13 2
6 A D 17 1
7 A B 16 7") %>%
tidyr::unite(Name1, Name1a, Name1b)
Edit: Orig answer was in packed format, but OP would like the first set of columns repeated for all lines, and 2nd and third appearances showing up in the row they originally appeared in.
Here's an approach using dplyr and tidyr.
# Keep track of original rows, label repeats, and make it long format
df_order <- df %>%
mutate(orig_row = row_number()) %>%
group_by(Name1) %>% mutate(repeat_no = row_number()) %>% ungroup() %>%
gather(col_type, value, Code1:R1)
# Make one copy of all the rows to keep in first column
df_ones <- df_order %>%
mutate(repeat_no = 1) %>%
unite(col_rpt, repeat_no, col_type)
# Get the repeated rows to add on
df_repeats <- df_order %>%
filter(repeat_no > 1) %>%
unite(col_rpt, repeat_no, col_type)
# Combine the two and spread out
output <- df_ones %>%
bind_rows(df_repeats) %>%
spread(col_rpt, value) %>%
arrange(orig_row) %>%
select(-orig_row)
Output:
> output
# A tibble: 7 x 7
Name1 `1_Code1` `1_R1` `2_Code1` `2_R1` `3_Code1` `3_R1`
<chr> <int> <int> <int> <int> <int> <int>
1 A_A 12 1 NA NA NA NA
2 A_B 13 2 NA NA NA NA
3 A_C 15 5 NA NA NA NA
4 A_B 8 4 8 4 NA NA
5 A_C 13 2 13 2 NA NA
6 A_D 17 1 NA NA NA NA
7 A_B 16 7 NA NA 16 7

Filtering data relative to first and last occurance of an event

I have a dataframe of an experiment, where stimulus is shown to participants, and time is measured continuously.
# reprex
df <-
tibble(stim = c(NA, NA, NA, NA, "a", "b", NA, "c", NA, "d", NA, NA, NA),
time = 0:12)
# A tibble: 13 x 2
stim time
<chr> <int>
1 NA 0
2 NA 1
3 NA 2
4 NA 3
5 a 4
6 b 5
7 NA 6
8 c 7
9 NA 8
10 d 9
11 NA 10
12 NA 11
13 NA 12
I want to create a generalized solution, using tidyverse functions to drop the data 1 second before and 2 seconds after the first and last marker, respectively. Using tidyverse, I thought this will work, but it throws an uninformative error.
df %>%
# store times for first and last stim
mutate(first_stim = drop_na(stim) %>% pull(time) %>% first(),
last_stim = drop_na(stim) %>% pull(time) %>% last()) %>%
# filter df based on new variables
filter(time >= first(first_stim) - 1 &
time <= first(last_stim) + 2)
Error in mutate_impl(.data, dots) : bad value
So I made a pretty ugly base r code to overcome this issue by changing the mutate:
df2 <- df %>%
mutate(first_stim = .[!is.na(.$stim), "time"][1,1],
last_stim = .[!is.na(.$stim), "time"][nrow(.[!is.na(.$stim), "time"]), 1])
# A tibble: 13 x 4
stim time first_stim last_stim
<chr> <int> <tibble> <tibble>
1 NA 0 4 9
2 NA 1 4 9
3 NA 2 4 9
4 NA 3 4 9
5 a 4 4 9
6 b 5 4 9
7 NA 6 4 9
8 c 7 4 9
9 NA 8 4 9
10 d 9 4 9
11 NA 10 4 9
12 NA 11 4 9
13 NA 12 4 9
Now I would only need to filter based on the new variables first_stim - 1 and last_stim + 2. But filter fails too:
df2 %>%
filter(time >= first(first_stim) - 1 &
time <= first(last_stim) + 2)
Error in filter_impl(.data, quo) :
Not compatible with STRSXP: [type=NULL].
I was able to do it in base R, but it is really ugly:
df2[(df2$time >= (df2[[1, "first_stim"]] - 1)) &
(df2$time <= (df2[[1, "last_stim"]] + 2))
,]
The desired output should look like this:
# A tibble: 13 x 2
stim time
<chr> <int>
4 NA 3
5 a 4
6 b 5
7 NA 6
8 c 7
9 NA 8
10 d 9
11 NA 10
12 NA 11
I believe that the errors are related to dplyr::nth() and related functions. And I've found some old issues that are related to this behavior, but should no longer exist https://github.com/tidyverse/dplyr/issues/1980
I would really appreciate if someone could highlight what is the problem, and how to do this in a tidy way.
You could use a combination of is.na and which...
library(dplyr)
df <-
tibble(stim = c(NA, NA, NA, NA, "a", "b", NA, "c", NA, "d", NA, NA, NA),
time = 0:12)
df %>%
filter(row_number() >= first(which(!is.na(stim))) - 1 &
row_number() <= last(which(!is.na(stim))) + 2)
# # A tibble: 9 x 2
# stim time
# <chr> <int>
# 1 NA 3
# 2 a 4
# 3 b 5
# 4 NA 6
# 5 c 7
# 6 NA 8
# 7 d 9
# 8 NA 10
# 9 NA 11
you could also make your first attempt work with a little modification...
df %>%
mutate(first_stim = first(drop_na(., stim) %>% pull(time)),
last_stim = last(drop_na(., stim) %>% pull(time))) %>%
filter(time >= first(first_stim) - 1 &
time <= first(last_stim) + 2)
We can create a cumulative sum of non-NA values and then find the row indices where the we encounter the first non-NA value and the last one. We then select rows based on the requirement. (-1 from start and +2 from end).
library(tidyverse)
df %>%
mutate(count_cumsum = cumsum(!is.na(stim))) %>%
slice((which.max(count_cumsum == 1) -1):(which.max(count_cumsum) + 2)) %>%
select(-count_cumsum)
# stim time
# <chr> <int>
#1 NA 3
#2 a 4
#3 b 5
#4 NA 6
#5 c 7
#6 NA 8
#7 d 9
#8 NA 10
#9 NA 11
Just to give an idea how count_cumsum looks:
df %>%
mutate(count_cumsum = cumsum(!is.na(stim)))
# A tibble: 13 x 3
# stim time count_cumsum
# <chr> <int> <int>
#1 NA 0 0
#2 NA 1 0
#3 NA 2 0
#4 NA 3 0
#5 a 4 1
#6 b 5 2
#7 NA 6 2
#8 c 7 3
#9 NA 8 3
#10 d 9 4
#11 NA 10 4
#12 NA 11 4
#13 NA 12 4

Resources