lag/lead entire dataframe in R - r

I am having a very hard time leading or lagging an entire dataframe. What I am able to do is shifting individual columns with the following attempts but not the whole thing:
require('DataCombine')
df_l <- slide(df, Var = var1, slideBy = -1)
using colnames(x_ret_mon) as Var does not work, I am told the variable names are not found in the dataframe.
This attempt shifts the columns right but not down:
df_l<- dplyr::lag(df)
This only creates new variables for the lagged variables but then I do not know how to effectively delete the old non lagged values:
df_l<-shift(df, n=1L, fill=NA, type=c("lead"), give.names=FALSE)

Use dplyr::mutate_all to apply lags or leads to all columns.
df = data.frame(a = 1:10, b = 21:30)
dplyr::mutate_all(df, lag)
a b
1 NA NA
2 1 21
3 2 22
4 3 23
5 4 24
6 5 25
7 6 26
8 7 27
9 8 28
10 9 29

I don't see the point in lagging all columns in a data.frame. Wouldn't that just correspond to rbinding an NA row to your original data.frame (minus its last row)?
df = data.frame(a = 1:10, b = 21:30)
rbind(NA, df[-nrow(df), ]);
# a b
#1 NA NA
#2 1 21
#3 2 22
#4 3 23
#5 4 24
#6 5 25
#7 6 26
#8 7 27
#9 8 28
#10 9 29
And similarly for leading all columns.

A couple more options
data.frame(lapply(df, lag))
require(purrr)
map_df(df, lag)
If your data is a data.table you can do
require(data.table)
as.data.table(shift(df))
Or, if you're overwriting df
df[] <- lapply(df, lag) # Thanks Moody
require(magrittr)
df %<>% map_df(lag)

Related

Keep the row if the specific column is the minimum value of that row

I cannot share the dataset but I will explain it as best as I can.
The dataset has 50 columns 48 of them are in Y/m/d h:m:s format. also the data has many NA, but it must not be removed.
Let's say there is a column B. I want to remove the rows if the value of B is not the earliest in that row.
How can I do this in R? For example, the original would be like this:
df <- data.frame(
A = c(11,19,17,6,13),
B = c(18,9,5,16,12),
C = c(14,15,8,87,16))
A B C
1 11 18 14
2 19 9 15
3 17 5 8
4 6 16 87
5 13 12 16
but I want this:
A B C
1 19 9 15
2 17 5 8
3 13 12 16
You could use apply() to find the minimum for each row.
df |> subset(B == apply(df, 1, min, na.rm = TRUE))
# A B C
# 2 19 9 15
# 3 17 5 8
# 5 13 12 16
The tidyverse equivalent is
library(tidyverse)
df %>% filter(B == pmap(across(A:C), min, na.rm = TRUE))
If you are willing to use data.table, you could do the following for the example.
library(data.table)
setDT(df)
df[(B < A & B < C)]
A B C
1: 19 9 15
2: 17 5 8
3: 13 12 16
More generally, you could do
df <- as.data.table(df)
df[, min := do.call(pmin, .SD)][B == min, !"min"]
.SDcols in the first [ would let you control which columns you want to take the min over, if you wanted to eg. exclude some. I am not super knowledgeable about the inner workings of data.table, but I believe that creating this new column is probably efficient RAM-wise.

Is there a better way to combine these dataframes and match values?

I have two dataframes that I have combined using left_join()
data1 can be simplified to something like...
Date <- as.Date(c('2011-7-26','2011-7-26','2010-11-1','2010-11-1','2009-5-10','2009-5-10','2008-3-25','2008-3-25','2007-3-14','2007-3-14'))
Location <- c("A","B","A","B","A","B","A","B","A","B")
Result <- sample(1:30, 10)
data1 <- data.frame(Date,Location,Result)
data2 can be simplified to something like...
Date <- as.Date(c('2011-7-26','2009-5-10','2007-3-14'))
Flow_A <- c(6,2,9)
Flow_B <- c(10,11,25)
data2 <- data.frame(Date,Flow_A,Flow_B)
After combining by date, I have this
data3 <- left_join(data2, data1, by = "Date")
Date Flow_A Flow_B Location Result
1 2011-07-26 6 10 A 11
2 2011-07-26 6 10 B 17
3 2009-05-10 2 11 A 6
4 2009-05-10 2 11 B 22
5 2007-03-14 9 25 A 20
6 2007-03-14 9 25 B 1
Each value in Result corresponds to a specific Location (A or B) and I want to attach the correct values for Flow (Flow_A or Flow_B) to that row according to location (i.e. combine columns Flow_A and Flow_B into one column 'Flow' with just the correct value). I have been able to do this using a combination of mutate(),ifelse(),grepl(),and very simple functions:
a <- data3$Flow_A
Choose_A <- function(a) {
return(a)}
d <- data3$Flow_B
Choose_B <- function(b) {
return(b)}
data3 <- mutate(data3, Flow =
ifelse(grepl("A", Location), Choose_A(a),
ifelse(grepl("B", Location), Choose_B(b),NA)))
Date Flow_A Flow_B Location Result Flow
1 2011-07-26 6 10 A 11 6
2 2011-07-26 6 10 B 17 10
3 2009-05-10 2 11 A 6 2
4 2009-05-10 2 11 B 22 11
5 2007-03-14 9 25 A 20 9
6 2007-03-14 9 25 B 1 25
But this seems rather clunky. Is there a better (more efficient) way to achieve this?
Please excuse my ignorance - I'm still learning!
Thanks!
You can create a vector of column numbers to extract from each row using match and create a matrix with cbind which is used to subset relevant value from either 'Flow_A' or 'Flow_B' depending on Location column.
column_num <- match(paste0('Flow_', data3$Location), names(data3))
row_num <- seq_len(nrow(data3))
data3$Flow <- data3[cbind(row_num, column_num)]

Adding new columns to dataframe with suffix

I want to subtract one column from another and create a new one using the corresponding suffix in the first column. I have approx 50 columns
I can do it "manually" as follows...
df$new1 <- df$col_a1 - df$col_b1
df$new2 <- df$col_a2 - df$col_b2
What is the easiest way to create a loop that does the job for me?
We can use grep to identify columns which has "a" and "b" in it and subtract them directly.
a_cols <- grep("col_a", names(df))
b_cols <- grep("col_b", names(df))
df[paste0("new", seq_along(a_cols))] <- df[a_cols] - df[b_cols]
df
# col_a1 col_a2 col_b1 col_b2 new1 new2
#1 10 15 1 5 9 10
#2 9 14 2 6 7 8
#3 8 13 3 7 5 6
#4 7 12 4 8 3 4
#5 6 11 5 9 1 2
#6 5 10 6 10 -1 0
data
Tested on this data
df <- data.frame(col_a1 = 10:5, col_a2 = 15:10, col_b1 = 1:6, col_b2 = 5:10)

R - replace all values smaller than a specific value in a column with the nearest bigger value

I have a data frame like this one:
df <- data.frame(c(1,2,3,4,5,6,7), c(0,23,55,0,1,40,21))
names(df) <- c("a", "b")
a b
1 0
2 23
3 55
4 0
5 1
6 40
7 21
Now I want to replace all values smaller than 22 in column b with the nearest bigger value. Of course it is possible to use loops, but since I have quite big datasets this is way too slow.
The solution should look somewhat like this:
a b
1 23
2 23
3 55
4 55
5 40
6 40
7 40
Here is a tidyverse possibility (but note #phiver's comment on replacement ambiguities)
library(tidyverse);
df %>%
mutate(b = ifelse(b < 22, NA, b)) %>%
fill(b) %>%
fill(b, .direction = "up");
# a b
#1 1 23
#2 2 23
#3 3 55
#4 4 55
#5 5 55
#6 6 40
#7 7 40
Explanation: Replace values b < 22 with NA and then use fill to fill NAs with previous/following non-NA entries.
Sample data
df <- data.frame(a = c(1,2,3,4,5,6,7), b = c(0,23,55,0,1,40,21))
You can use zoo::rollapply :
library(zoo)
df$b <- rollapply(df$b,3,function(x)
if (x[2] < 22) min(x[x>22]) else x[2],
partial =T)
# df
# a b
# 1 1 23
# 2 2 23
# 3 3 55
# 4 4 55
# 5 5 40
# 6 6 40
# 7 7 40
In base R you could do this for the same output:
transform(df, b = sapply(seq_along(b),function(i)
if (b[i] < 22) {
bi <- c(b,Inf)[seq(i-1,i+1)]
min(bi[bi>=22])
} else b[i]))

Removing NAs when multiplying columns

This is a really simple question, but I am hoping someone will be able to help me avoid extra lines of unnecessary code. I have a simple dataframe:
Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
What I want to do is produce an extra column which is the multiplication of A, B and C, which I will then cbind to the original dataframe.
So, I would normally use:
attach(Df.1)
D<-A*B*C
But obviously where the NAs are in column C, I get an NA in variable D. I don't want to exclude all the NA rows, rather just ignore the NA values in this column (and then the value in D would simply be the multiplication of A and B, or where C was available, A*B*C.
I know I could simply replace the NAs with 1s, so the calculation remains unchanged, or use if statements, but I was wodnering what the simplist way of doing this is?
Any ideas?
You can use prod which has an na.rm argument. To do it by row use apply:
apply(Df.1,1,prod,na.rm=TRUE)
[1] 10 60 14 120 72 36
As #James said, prod and apply will work, but you don't need to waste memory storing it in a separate variable, or even cbinding it
Df.1$D = apply(Df.1, 1, prod, na.rm=T)
Assigning the new variable in the data frame directly will work.
> Df.1 <- data.frame(A = c(5,4,7,6,8,4),B = (c(1,5,2,4,9,1)),C=(c(2,3,NA,5,NA,9)))
> Df.1
A B C
1 5 1 2
2 4 5 3
3 7 2 NA
4 6 4 5
5 8 9 NA
6 4 1 9
> Df.1$D = apply(Df.1, 1, prod, na.rm=T)
> Df.1$D
[1] 10 60 14 120 72 36
> Df.1
A B C D
1 5 1 2 10
2 4 5 3 60
3 7 2 NA 14
4 6 4 5 120
5 8 9 NA 72
6 4 1 9 36

Resources