R: How to calculate lag for multiple columns by group for data table - r

I would like to calculate the diff of variables in a data table, grouped by id. Here is some sample data. The data is recorded at a sample rate of 1 Hz. I would like to estimate the first and second derivatives (speed, acceleration)
df <- read.table(text='x y id
1 2 1
2 4 1
3 5 1
1 8 2
5 2 2
6 3 2',header=TRUE)
dt<-data.table(df)
Expected output
# dx dy id
# NA NA 1
# 1 2 1
# 1 1 1
# NA NA 2
# 4 -6 2
# 1 1 2
Here's what I've tried
dx_dt<-dt[, diff:=c(NA,diff(dt[,'x',with=FALSE])),by = id]
Output is
Error in `[.data.frame`(dt, , `:=`(diff, c(NA, diff(dt[, "x", with = FALSE]))), :
unused argument (by = id)
As pointed out by Akrun, the 'speed' terms (dx, dy) can be obtained using either data table or plyr. However, I'm unable to understand the calculation well enough to extend it to acceleration terms. So, how to calculate the 2nd lag terms?
dt[, c('dx', 'dy'):=lapply(.SD, function(x) c(NA, diff(x))),
+ by=id]
produces
x y id dx dy
1: 1 2 1 NA NA
2: 2 4 1 1 2
3: 3 5 1 1 1
4: 1 8 2 NA NA
5: 5 2 2 4 -6
6: 6 3 2 1 1
How to expand to get a second diff, or the diff of dx, dy?
x y id dx dy dx2 dy2
1: 1 2 1 NA NA NA NA
2: 2 4 1 1 2 NA NA
3: 3 5 1 1 1 0 -1
4: 1 8 2 NA NA NA NA
5: 5 2 2 4 -6 NA NA
6: 6 3 2 1 1 -3 7

You can try
setnames(dt[, lapply(.SD, function(x) c(NA,diff(x))), by=id],
2:3, c('dx', 'dy'))[]
# id dx dy
#1: 1 NA NA
#2: 1 1 2
#3: 1 1 1
#4: 2 NA NA
#5: 2 4 -6
#6: 2 1 1
Another option would be to use dplyr
library(dplyr)
df %>%
group_by(id) %>%
mutate_each(funs(c(NA,diff(.))))%>%
rename(dx=x, dy=y)
Update
You can repeat the step twice
dt[, c('dx', 'dy'):=lapply(.SD, function(x) c(NA, diff(x))), by=id]
dt[,c('dx2', 'dy2'):= lapply(.SD, function(x) c(NA, diff(x))),
by=id, .SDcols=4:5]
dt
# x y id dx dy dx2 dy2
#1: 1 2 1 NA NA NA NA
#2: 2 4 1 1 2 NA NA
#3: 3 5 1 1 1 0 -1
#4: 1 8 2 NA NA NA NA
#5: 5 2 2 4 -6 NA NA
#6: 6 3 2 1 1 -3 7
Or we can use the shift function from data.table
dt[, paste0("d", c("x", "y")) := .SD - shift(.SD), by = id
][, paste0("d", c("x2", "y2")) := .SD - shift(.SD) , by = id, .SDcols = 4:5 ]

Related

Set values of a column to NA after a given point

I have a dataset like this:
ID NUMBER X
1 5 2
1 3 4
1 6 3
1 2 5
2 7 3
2 3 5
2 9 3
2 4 2
and I'd like to set values of variable X to NA after the variable NUMBER increses (even though after it decreases again) for each ID, and obtaining:
ID NUMBER X
1 5 2
1 3 4
1 6 NA
1 2 NA
2 7 3
2 3 5
2 9 NA
2 4 NA
How can I do it?
Thanks for your help!
Surely not the most elegant solution, but it is quite intuitive:
library(data.table)
setDT(d)
d[, n := ifelse(NUMBER > shift(NUMBER, 1, "lag"),1,0), by=ID]
d[is.na(n), n := 0]
d[, n := cumsum(n), by=ID]
d[n>0, X := NA ]
d
ID NUMBER X n
1: 1 5 2 0
2: 1 3 4 0
3: 1 6 NA 1
4: 1 2 NA 1
5: 2 7 3 0
6: 2 3 5 0
7: 2 9 NA 1
8: 2 4 NA 1
You can do this with dplyr package. If your dataframe is called df then you can use this code:
df %>% group_by(ID) %>%
mutate ( X = c(X[1:(min(which(diff(Number) > 0)))],rep("NA",length(X)-(min(which(diff(Number) > 0)))))) %>%
as.data.frame()
I first grouped them with ID and then I found the first increasing number with diff and which.

Shifting the last non-NA value by id

I have a data table that looks like this:
DT<-data.table(day=c(1,2,3,4,5,6,7,8),Consumption=c(5,9,10,2,NA,NA,NA,NA),id=c(1,2,3,1,1,2,2,1))
day Consumption id
1: 1 5 1
2: 2 9 2
3: 3 10 3
4: 4 2 1
5: 5 NA 1
6: 6 NA 2
7: 7 NA 2
8: 8 NA 1
I want to create two columns that show the last non-Na consumption value before the observation, and the day difference between those observations using the id groups. So far, I tried this:
DT[, j := day-shift(day, fill = NA,n=1), by = id]
DT[, yj := shift(Consumption, fill = NA,n=1), by = id]
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 1 NA
8: 8 NA 1 3 NA
However, I want that the lagged consumption values with n=1 come from the rows which have non-NA consumption values. For example, in the 7th row and column "yj", the yj value is NA because it comes from the 6th row which has NA consumption. I want it to come from the 2nd row. Therefore, I would like the end up with this data table:
day Consumption id j yj
1: 1 5 1 NA NA
2: 2 9 2 NA NA
3: 3 10 3 NA NA
4: 4 2 1 3 5
5: 5 NA 1 1 2
6: 6 NA 2 4 9
7: 7 NA 2 5 9
8: 8 NA 1 4 2
Note: The reason for specifically using the parameter n of shift function is that I will also need the 2nd last non-Na consumption values in the next step.
Thank You
Here's a data.table solution with an assist from zoo:
library(data.table)
library(zoo)
DT[, `:=`(day_shift = shift(day),
yj = shift(Consumption)),
by = id]
#make the NA yj records NA for the days
DT[is.na(yj), day_shift := NA_integer_]
#fill the DT with the last non-NA value
DT[,
`:=`(day_shift = na.locf(day_shift, na.rm = F),
yj = zoo::na.locf(yj, na.rm = F)),
by = id]
# finally calculate j
DT[, j:= day - day_shift]
# you can clean up the ordering or remove columns later
DT
day Consumption id day_shift yj j
1: 1 5 1 NA NA NA
2: 2 9 2 NA NA NA
3: 3 10 3 NA NA NA
4: 4 2 1 1 5 3
5: 5 NA 1 4 2 1
6: 6 NA 2 2 9 4
7: 7 NA 2 2 9 5
8: 8 NA 1 4 2 4

sapply function(x) where x is subsetted argument

So, I want to generate a new vector from the information in two existing ones (numerical), one which sets the id for the participant, the other indicating the observation number. Each paticipant has been observed different times.
Now, the new vector should should state: 0 when obs_no=1; 1 when obs_no=last observation for that id; NA for cases in between.
id obs_no new_vector
1 1 0
1 2 NA
1 3 NA
1 4 NA
1 5 1
2 1 0
2 2 1
3 1 0
3 2 NA
3 3 1
I figure I could do this separatly for every id using code like this
new_vector <- c(0, rep(NA, times=length(obs_no[id==1])-2), 1)
Or I guess just using max() but it wouldn't make any difference.
But adding each participant manually is really inconvenient since I have a lot of cases. I can't figure out how to make a generic function. I tried to define a function(x) using sapply but cant get it to work since x is positioned within subsetting brackets.
Any advice would be helpful. Thanks.
ave to the rescue:
dat$newvar <- NA
dat$newvar <- with(dat,
ave(newvar, id, FUN=function(x) replace(x, c(length(x),1), c(1,0)) )
)
Or use a bit of duplicated() fun:
dat$newvar <- NA
dat$newvar[!duplicated(dat$id, fromLast=TRUE)] <- 1
dat$newvar[!duplicated(dat$id)] <- 0
Both giving:
# id obs_no new_vector newvar
#1 1 1 0 0
#2 1 2 NA NA
#3 1 3 NA NA
#4 1 4 NA NA
#5 1 5 1 1
#6 2 1 0 0
#7 2 2 1 1
#8 3 1 0 0
#9 3 2 NA NA
#10 3 3 1 1
You can also do this with dplyr
str <- "
id obs_no new_vector
1 1 0
1 2 NA
1 3 NA
1 4 NA
1 5 1
2 1 0
2 2 1
3 1 0
3 2 NA
3 3 1
"
dt <- read.table(textConnection(str), header = T)
library(dplyr)
dt %>% group_by(id) %>%
mutate(newvar = if_else(obs_no==1,0L,if_else(obs_no==max(obs_no),1L,as.integer(NA))))
We can use data.table
library(data.table)
i1 <- setDT(df1)[, .I[seq_len(.N) %in% c(1, .N)], id]$V1
df1[i1, newvar := c(0, 1)]
df1
# id obs_no new_vector newvar
# 1: 1 1 0 0
# 2: 1 2 NA NA
# 3: 1 3 NA NA
# 4: 1 4 NA NA
# 5: 1 5 1 1
# 6: 2 1 0 0
# 7: 2 2 1 1
# 8: 3 1 0 0
# 9: 3 2 NA NA
#10: 3 3 1 1
Use split:
result = lapply(split(obs_no, id), function (x) c(0, rep(NA, length(x) - 2), 1))
This gives you a list of vectors. You can paste them back together like this:
do.call(c, result)

Fill missing values with new data R-Python

I have two dataset x and y
> x
a index b
1 1 1 5
2 NA 2 6
3 2 3 NA
4 NA 4 9
> y
index a
1 2 100
2 4 101
>
I would like to fill the missing values of x with the values contained in y.
I have tried to use the merge function but the result is not what I want.
> merge(x,y, by = 'index', all=T)
index a.x b a.y
1 1 1 5 NA
2 2 NA 6 100
3 3 2 7 NA
4 4 NA 9 101
In the real problem there are additional limitations:
1 - y does not fill all the missing values
2 - x and y have in common more variables (so not only a and index)
EDIT : More realistic example
> x
a index b c
1 1 1 5 NA
2 NA 2 6 NA
3 2 3 NA 5
4 NA 4 9 NA
5 NA 5 10 6
> y
index a c
1 2 100 4
2 4 101 NA
>
The solution would be accepted both in python or R
I used your merge idea and did the following using dplyr. I am sure there will be better ways of doing this task.
index <- 1:5
a <- c(1, NA, 2, NA, NA)
b <- c(5,6,NA,9,10)
c <- c(NA,NA,5,NA,6)
ana <- data.frame(index, a,b,c, stringsAsFactors=F)
index <- c(2,4)
a <- c(100, 101)
c <- c(4, NA)
bob <- data.frame(index, a,c, stringsAsFactors=F)
> ana
index a b c
1 1 1 5 NA
2 2 NA 6 NA
3 3 2 NA 5
4 4 NA 9 NA
5 5 NA 10 6
> bob
index a c
1 2 100 4
2 4 101 NA
ana %>%
merge(., bob, by = "index", all = TRUE) %>%
mutate(a.x = ifelse(a.x %in% NA, a.y, a.x)) %>%
mutate(c.x = ifelse(c.x %in% NA, c.y, c.x))
index a.x b c.x a.y c.y
1 1 1 5 NA NA NA
2 2 100 6 4 100 4
3 3 2 NA 5 NA NA
4 4 101 9 NA 101 NA
5 5 NA 10 6 NA NA
I overwrote a.x (ana$$a) using a.y (bob$a) using mutate. I did a similar thing for c.x (ana$c). If you remove a.y and c.y in the end, that will be the outcome you expect, I think.
Try:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
m1$b = x$b[match(m1$index, x$index)]
m1$c = x$c[match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
or, if there many other columns like b and c:
xa = x[,c(1,2)]
m1 = merge(y,xa,all=T)
m1 = m1[!duplicated(m1$index),]
for(nn in names(x)[3:4]) m1[,nn] = x[,nn][match(m1$index, x$index)]
m1
index a b c
1 1 1 5 NA
2 2 100 6 NA
4 3 2 NA 5
5 4 101 9 NA
7 5 NA 10 6
If there are multiple columns to replace, you could try converting from wide to long form as shown in the first two methods and replace in one step
m1 <- merge(x,y, by="index", all=TRUE)
m1L <- reshape(m1, idvar="index", varying=grep("\\.", colnames(m1)), direction="long", sep=".")
row.names(m1L) <- 1:nrow(m1L)
lst1 <- split(m1L, m1L$time)
indx <- is.na(lst1[[1]][,4:5])
lst1[[1]][,4:5][indx] <- lst1[[2]][,4:5][indx]
res <- lst1[[1]][,c(4,1,2,5)]
res
# a index b c
#1 1 1 5 NA
#2 100 2 6 4
#3 2 3 NA 5
#4 101 4 9 NA
#5 NA 5 10 6
Or you could use dplyr with tidyr
library(dplyr)
library(tidyr)
z <- left_join(x, y, by="index") %>%
gather(Var, Val, matches("\\.")) %>%
separate(Var, c("Var1", "Var2"))
indx1 <- which(is.na(z$Val) & z$Var2=="x")
z$Val[indx1] <- z$Val[indx1+nrow(z)/2]
z %>%
spread(Var1, Val) %>%
filter(Var2=="x") %>%
select(-Var2)
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6
Or split the columns by matching names before the . and use lapply to replace the NA's.
indx <- grep("\\.", colnames(m1),value=TRUE)
res <- cbind(m1[!names(m1) %in% indx],
sapply(split(indx, gsub("\\..*", "", indx)), function(x) {
x1 <- m1[x]
indx1 <- is.na(x1[,1])
x1[,1][indx1] <- x1[,2][indx1]
x1[,1]} ))
res
# index b a c
#1 1 5 1 NA
#2 2 6 100 4
#3 3 NA 2 5
#4 4 9 101 NA
#5 5 10 NA 6

How to calculate mean and median of different columns under some conditions using data.table, aggregation with R

I have four vectors (columns)
x y z t
1 1 1 10
1 1 1 15
1 4 1 14
2 3 1 15
2 2 1 17
2 1 2 19
2 4 2 18
2 4 2 NA
2 2 2 45
3 3 2 NA
3 1 3 59
4 3 3 23
4 4 3 45
4 4 4 74
5 1 4 86
I know how to calculate the mean and median of vector t for each value from x,y, and z.
The example is:
bar <- data.table(expand.grid(x=unique(data[x %in% c(1,2,3,4,5),x]),
y=unique(data[y %in% c(1,2,3,4),y]),
z=unique(data[z %in% c(1,2,3,4),z])))
foo <- data[z %in% c(1,2,3,4),list(
mean.t=mean(t,na.rm=T),
median.t=median(t,na.rm=T))
,by=list(x,y,z)]
merge(bar[,list(x,y,z)],foo,by=c("x","y","z"),all.x=T)
The result is:
x y z mean.t median.t
1: 1 1 1 12.5 12.5
2: 1 1 2 NA NA
3: 1 1 3 NA NA
4: 1 1 4 NA NA
5: 1 2 1 NA NA
........................
79: 5 4 3 NA NA
80: 5 4 4 NA NA
But now I have the question: how to do the same calculations for x,y,z and t but for z not as numbers from 1 to 4, but for groups like:
if 0 < z <= 2 group I,
if 2 < z <= 3 group II and
if 3 < z <= 4 group III.
So, the output should be in format like:
x y z mean.t median.t
1: 1 1 I
2: 1 1 II
3: 1 1 III
4: 1 2 I
5: 1 2 II
6: 1 2 III
7: 1 3 I
8: 1 3 II
9: 1 3 III
10: 1 4 I
..........
Define a new column, zGroup to group by.
(The data in this example is a little different than yours)
#create some data
dt<-data.table(x=rep(c(1,2),each=4),
y=rep(c(1,2),each=2,times=2),
z=rep(c(1,2,3,4),times=2),t=1:8)
#add a zGroup column
dt[0<z & z<=2, zGroup:=1]
dt[2<z & z<=3, zGroup:=2]
dt[3<z & z<=4, zGroup:=3]
#group by unique combinations of x, y, zGroup taking mean and median of t
dt[,list(mean.t=mean(t), median.t=as.double(median(t))), by=list(x,y,zGroup)]
Note, this will error without coercing the median to a double. See this post for details.

Resources