I would like to lag multiple specific columns of a data frame in R.
Let's take this generic example. Let's assume I have defined which columns of my dataframe I need to lag:
Lag <- c(0, 1, 0, 1)
Lag.Index <- is.element(Lag, 1)
df <- data.frame(x1 = 1:8, x2 = 1:8, x3 = 1:8, x4 = 1:8)
My initial dataframe:
x1 x2 x3 x4
1 1 1 1 1
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
I would like to compute the following dataframe:
x1 x2 x3 x4
1 1 NA 1 NA
2 2 2 2 2
3 3 3 3 3
4 4 4 4 4
5 5 5 5 5
6 6 6 6 6
7 7 7 7 7
8 8 8 8 8
I would know how to do it for only one lagged column as shown here, but not able to find a way to do it for multiple lagged columns in an elegant way. Any help is very much appreciated.
You can use purrr's map2_dfc to lag different values by column.
purrr::map2_dfc(df, Lag, dplyr::lag)
# x1 x2 x3 x4
# <int> <int> <int> <int>
#1 1 NA 1 NA
#2 2 1 2 1
#3 3 2 3 2
#4 4 3 4 3
#5 5 4 5 4
#6 6 5 6 5
#7 7 6 7 6
#8 8 7 8 7
Or with data.table :
library(data.table)
setDT(df)[, names(df) := Map(shift, .SD, Lag)]
A data.table option using shift along with Vectorize
> setDT(df)[, Vectorize(shift)(.SD, Lag)]
x1 x2 x3 x4
[1,] 1 NA 1 NA
[2,] 2 1 2 1
[3,] 3 2 3 2
[4,] 4 3 4 3
[5,] 5 4 5 4
[6,] 6 5 6 5
[7,] 7 6 7 6
[8,] 8 7 8 7
Not sure whether this is elegant enough, but I would use dplyr's mutate_at function to tweak columns
df %>% dplyr::mutate_at(.vars = vars(x2,x4),.funs = ~lag(., default = NA))
We convert the lag to logical class, get the corresponding names and use across from dplyr
library(dplyr)
df %>%
mutate(across(names(.)[as.logical(Lag)], lag))
# x1 x2 x3 x4
#1 1 NA 1 NA
#2 2 1 2 1
#3 3 2 3 2
#4 4 3 4 3
#5 5 4 5 4
#6 6 5 6 5
#7 7 6 7 6
#8 8 7 8 7
Or we can do this in base R
df[as.logical(Lag)] <- rbind(NA, df[-nrow(df), as.logical(Lag)])
Related
I would like to create lagged variables for several columns that are grouped by two conditions.
Here is the dataset:
df <- data.frame(id = c(rep(1,4),rep(2,4)), tp = rep(1:4,2), x1 = 1:8, x2 = 2:9, x3 = 3:10, x4 = 4:11)
> df
id tp x1 x2 x3 x4
1 1 1 1 2 3 4
2 1 2 2 3 4 5
3 1 3 3 4 5 6
4 1 4 4 5 6 7
5 2 1 5 6 7 8
6 2 2 6 7 8 9
7 2 3 7 8 9 10
8 2 4 8 9 10 11
I want to lag x1, x2, x3, x4 that are grouped by id and tp and create new variables x1_lag1, x2_lag1, x3_lag1, x4_lag1, like this:
> df
id tp x1 x2 x3 x4 x1_lag1 x2_lag1 x3_lag1 x4_lag1
1 1 1 1 2 3 4 2 3 4 5
2 1 2 2 3 4 5 3 4 5 6
3 1 3 3 4 5 6 4 5 6 7
4 1 4 4 5 6 7 NA NA NA NA
5 2 1 5 6 7 8 6 7 8 9
6 2 2 6 7 8 9 7 8 9 10
7 2 3 7 8 9 10 8 9 10 11
8 2 4 8 9 10 11 NA NA NA NA
How to achieve that?
Your result doesn't seem to be grouped by tp at all. It is grouped by id and ordered by tp within the id grouping.
Generally a "lag" is a variable that takes the value from the previous row. The columns you want labeled as "lag" columns take the value from the next row, so we use the lead function.
library(dplyr)
df %>%
group_by(id) %>%
mutate(across(starts_with("x"), lead, .names = "{.col}_lag1")) %>%
ungroup()
# A tibble: 8 × 10
id tp x1 x2 x3 x4 x1_lag1 x2_lag1 x3_lag1 x4_lag1
<dbl> <int> <int> <int> <int> <int> <int> <int> <int> <int>
1 1 1 1 2 3 4 2 3 4 5
2 1 2 2 3 4 5 3 4 5 6
3 1 3 3 4 5 6 4 5 6 7
4 1 4 4 5 6 7 NA NA NA NA
5 2 1 5 6 7 8 6 7 8 9
6 2 2 6 7 8 9 7 8 9 10
7 2 3 7 8 9 10 8 9 10 11
8 2 4 8 9 10 11 NA NA NA NA
I have this dataframe:
> df <- data.frame(Semester = sample(1:4, 20, replace=TRUE),
X1 = sample(c(1:7,NA), 20, replace =TRUE),
X2 = sample(c(1:7,NA), 20, replace =TRUE),
X3 = sample(c(1:7,NA), 20, replace =TRUE),
X4 = sample(c(1:7,NA), 20, replace =TRUE),
X5 = sample(c(1:7,NA), 20, replace =TRUE),
X6 = sample(c(1:7,NA), 20, replace =TRUE),
X7 = sample(c(1:7,NA), 20, replace =TRUE),
stringsAsFactors = FALSE)
> df
Semester X1 X2 X3 X4 X5 X6 X7
1 4 3 7 NA NA 1 2 7
2 3 NA 3 NA 4 3 2 6
3 1 2 5 3 4 7 NA 2
4 3 1 1 6 1 3 2 4
5 1 1 2 1 3 2 6 5
6 2 1 7 1 5 2 2 6
7 4 7 6 5 2 7 1 2
8 1 5 5 7 4 5 1 5
9 1 3 1 1 5 6 3 7
10 3 6 NA 1 1 5 NA 2
11 1 1 6 6 6 3 5 7
12 3 1 5 1 2 3 1 NA
13 4 1 4 1 1 5 6 1
14 1 5 4 4 NA 5 3 3
15 2 2 NA 4 1 1 5 4
16 3 6 7 6 7 3 3 7
17 1 1 2 4 5 4 5 3
18 4 4 7 7 6 NA 4 NA
19 3 4 2 3 4 4 3 5
20 2 1 NA 3 5 7 NA 6
And I'm trying to get this output, where n_* is the count for the number n_* for the all X* variables. For example, n_7 for Semester==1 is the count where X* values are 7 (This output is just referential, the values are artificial).
Semester n_7 n_6 n_5 n_4 n_3 n_2 n_1
1 5 7 1 5 7 7 7
2 4 10 1 3 6 3 4
3 5 5 2 5 3 3 2
4 3 9 10 5 7 0 0
I triedby(), but it counts the values of Semester also. Is there another way to do this?:
by(df, df$Semester,function(df){
count_if(eq(7), df)
count_if(eq(6), df)
count_if(eq(5), df)
count_if(eq(4), df)
count_if(eq(3), df)
count_if(eq(2), df)
count_if(eq(1), df)})
You could use a dcast() melt() approach.
library(data.table)
dcast(melt(df, "Semester"), Semester ~ value, fun=length)[-9]
# Semester 1 2 3 4 5 6 7
# 1 1 5 8 10 2 7 8 4
# 2 2 8 6 7 2 5 2 5
# 3 3 2 1 4 3 2 4 5
# 4 4 1 1 3 4 7 2 8
How to retain only unique values in each row for a data frame
input is as below:
1 1 2 3 4 1 6 7 8
2 2 5 5 7 8 9 0 0
6 6 6 6 5 1 2 3 4
Output would be as below
1 2 3 4 6 7 8
2 5 7 8 9
6 5 1 2 3 4
plyr, unique i tried, but it retains the unique values in complete data set
You can use sapply or lapply to accomplish it .
#supposing your data.frame is called 'df'
sapply(df, unique)
#$x1
#[1] 1 2 3 4 6 7 8
#
#$x2
#[1] 2 5 7 8 9 0
#
#$x3
#[1] 6 5 1 2 3 4
or
lapply(df, unique)
#$x1
#[1] 1 2 3 4 6 7 8
#
#$x2
#[1] 2 5 7 8 9 0
#
#$x3
#[1] 6 5 1 2 3 4
# Imagine D is your data.frame object
apply(D,1, function(x) rle(x)$values)
A=apply(dat,1,unique)
data.frame(t(sapply(A,`length<-`,max(lengths(A)))))
X1 X2 X3 X4 X5 X6 X7
1 1 2 3 4 6 7 8
2 2 5 7 8 9 0 NA
3 6 5 1 2 3 4 NA
I'm looking to do the simple following task;
Imagine a dataframe with 3 rows and 5 columns;
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 3 4 5 6 7
3 2 3 4 5 6
I want to add another constant row to only the last 4 columns
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 3 4 5 6 7
3 2 3 4 5 6
4 5 5 5 5
How may I be able to accomplish this? thank you in advance!
Assuming your original dataframe is called df and you have data of length ncol(df)-1 in a vector called data.to.add:
rbind(df, c(NA, data.to.add))
here is a function:
your input data:
df<-
fread(" X1 X2 X3 X4 X5
1 1 2 3 4 5
2 3 4 5 6 7
3 2 3 4 5 6")
Example data:
a=df;b=rep(5,4)
function:
rbind.filling <- function(a,b) {
n_col<-max(ncol(a),ncol(b))
if(is.null(dim(a))) {a<-matrix(c(rep(NA,n_col-length(a)),a),nrow=1);;colnames(a)<-names(b)} else {
b<-matrix(c(rep(NA,n_col-length(b)),b),nrow=1);colnames(b)<-names(a)}
return(rbind(a,b))
}
call function:
rbind.filling(a,b)
# V1 V2 V3 V4 V5 V6
# 1: 1 1 2 3 4 5
# 2: 2 3 4 5 6 7
# 3: 3 2 3 4 5 6
# 4: NA NA 5 5 5 5
rbind.filling(b,a)
# V1 V2 V3 V4 V5 V6
# 1: NA NA 5 5 5 5
# 2: 1 1 2 3 4 5
# 3: 2 3 4 5 6 7
# 4: 3 2 3 4 5 6
I have this data.frame:
a <- c(rep("1", 3), rep("2", 3), rep("3",3), rep("4",3), rep("5",3))
b <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
df <-data.frame(a,b)
a b
1 1 1
2 1 2
3 1 3
4 2 4
5 2 5
6 2 6
7 3 7
8 3 8
9 3 9
10 4 10
11 4 11
12 4 12
13 5 13
14 5 14
15 5 15
I want to have something like this:
a <- c(rep("2", 3), rep("3", 3))
b <- c(4,5,6,7,8,9)
dffinal<-data.frame(a,b)
a b
1 2 4
2 2 5
3 2 6
4 3 7
5 3 8
6 3 9
I could use the "subset" function, but its not working
sub <- subset(df,c(2,3) == a )
a b
5 2 5
8 3 8
This command only takes one row of "2" and "3" in column "a".
Any Help?
You're confusing == with %in%:
subset(df, a %in% c(2,3))
# a b
# 4 2 4
# 5 2 5
# 6 2 6
# 7 3 7
# 8 3 8
# 9 3 9
what about this?
library(dplyr)
df %>% filter(a == 2 | a==3)
a b
1 2 4
2 2 5
3 2 6
4 3 7
5 3 8
6 3 9
We can use data.table. We convert the 'data.frame' to 'data.table' (setDT(df)), and set the 'key' as column 'a', then we subset the rows.
library(data.table)
setDT(df, key= 'a')[c('2','3')]
# a b
#1: 2 4
#2: 2 5
#3: 2 6
#4: 3 7
#5: 3 8
#6: 3 9