R: diff() between columns in data table - r

I apologize for improper titling.
The dt and d.dt are respectively the input and desired output.
library(data.table)
set.seed(10)
dt = data.frame(x=sample(10, 3), y=sample(10,3))
dt = as.data.table(dt)
# > dt
# x y
# 1 :6 7
# 2 :3 1
# 3 :4 2
d.dt = dt[, z:=c(-4, 3, NA)]
# > d.dt
# x y z
# 1 :6 7 -4
# 2 :3 1 3
# 3 :4 2 NA
The expectedd.dt[, z] is computed by subtracting the next row of columnx by the current row of columny.

Based on the expected output, it seems like we are subtracting the next row of 'x' with the current row of 'y'. To get the succeeding or next row, we can use shift from data.table and use the argument type='lead'.
dt[, z:= shift(x, type='lead')-y]
dt
# x y z
#1: 6 7 -4
#2: 3 1 3
#3: 4 2 NA

Related

Last observation of the previous group

I would like to know, if I have data that I can group by a variable, how can I get the last observation of the previous group?
I have the following data:
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))
I would like to create a new data.table that has the group ID and the difference between the last observation of the group from the last observation of the previous group. So that from the above I'd get:
a c
1: 1 NA
2: 2 9
3: 3 1
4: 4 -8
5: 5 5
Thanks!
We group by 'a', get the last element of 'b', then take the lag of 'c' by shifting
dt[, .(c = last(b)), a][, c:= shift(c)][]
Here is a way:
dt[, c := b * (1:.N == .N), by = a] ## get last row within the group
dt <- dt[b == c] ## filter data.table to get rows of interest
dt[, c := shift(c, type = "lag") - c][] ## getting difference using shift with lag argument
# a b c
#1: 1 11 NA
#2: 2 10 NA
#3: 3 18 9
#4: 4 19 -7
#5: 5 12 -8
data
set.seed(1)
dt <- data.table(a=c(1,1,1,2,2,2,2,2,3,3,3,3,3,3,4,4,5,5,5,5,5), b=sample.int(21))

Compare Lists in datatable

I have a data table(data) which looks like the following.
rn peoplecount
1 0,2,0,1
2 1,1,0,0
3 0,1,0,5
4 5,3,0,2
5 2,2,0,1
6 1,2,0,3
7 0,1,0,0
8 0,2,0,8
9 8,2,0,0
10 0,1,0,0
My goal is to find out all records which have the 1st element of the present row not matching with 4th element of previous row. In this example, 7th row matches the criteria. How can I get a list of all such records.
My attempt so far.
data[, previous_peoplecount:= c(NA, peoplecount[shift(seq_along(peoplecount), fill = 0)])]
This gives a new table as follows:
rn peoplecount previous_peoplecount
1 0,2,0,1 NA
2 1,1,0,0 0,2,0,1
3 0,1,0,5 1,1,0,0
4 5,3,0,2 0,1,0,5
5 0,2,0,1 5,3,0,2
6 1,2,0,3 0,2,0,1
7 0,1,0,0 1,2,0,3
8 0,2,0,8 0,1,0,0
9 8,2,0,0 0,2,0,8
10 0,1,0,0 8,2,0,0
Now I have to fetch all records where 1st element of people_count is not equal to 4th element of previous_peoplecount. I am stuck at this part. Any suggestions?
Edit: poeplecount is list of numerics.
You can try something along the lines of removing all but first value and all but last value, and comparing, i.e.
library(data.table)
setDT(dt)[, first_pos := sub(',.*', '', peoplecount)][,
last_pos_shifted := sub('.*,', '', shift(peoplecount))][
first_pos != last_pos_shifted,]
which gives,
rn peoplecount first_pos last_pos_shifted
1: 7 0,1,0,0 0 3
I would convert to long format and then select interested elements:
dt <- data.table(rn = 1:3, x = lapply(1:3, function(x) x:(x+3)))
dt$x[[2]] <- c(4, 1, 1, 1)
dt
# rn x
# 1: 1 1,2,3,4
# 2: 2 4,1,1,1
# 3: 3 3,4,5,6
# convert to long format
dt2 <- dt[, .(rn = rep(rn, each = 4), x = unlist(x))]
dt2[, id:= 1:4]
dtSelected <- dt2[x == shift(x) & id == 4]
dtSelected
# rn x id
# 1: 2 1 4
dt[dtSelected$rn]
# rn x
# 1: 2 4,1,1,1
I was not satisfied with the answers and came up with my own solution as follows:
h<-sapply(data$peoplecount,function(x){x[1]})
t<-sapply(data$peoplecount,function(x){x[4]})
indices<-which(head(t,-1)!=tail(h,-1))
Thanks to #Sotos and #minem to push me in the correct direction.

create column in datatable depending on it's values

I have got single column in data table
library(data.table)
DT <- data.table(con=c(1:5))
My result is a data table with new column x calculated as follows: first value should be first value of con(here:1), next(second) value should be calculated by muliplication second value of con times first value of x. Third value of x is a result of multiplcation third value of con times second value of x and so on. Result:
DT <- data.table(con=c(1:5), x = c(1,2,6,24,120))
I tried use shifts but it did non helped, below some lines of my code:
DT <- data.table(con=c(1:5))
DT[, x := shift(con,1, type = "lead")]
DT[, x := shift(x, 1)]
DT[, x := con * x]
You are looking for cumprod
DT[,x:=cumprod(con)]
DT
con x
1: 1 1
2: 2 2
3: 3 6
4: 4 24
5: 5 120
We can use the accumulate function from the purrr package.
library(data.table)
library(purrr)
DT <- data.table(con=c(1:5))
DT[, x := accumulate(con, `*`)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120
Or the Reduce function from the base R.
DT <- data.table(con=c(1:5))
DT[, x:= Reduce(`*`, con, accumulate = TRUE)][]
# con x
# 1: 1 1
# 2: 2 2
# 3: 3 6
# 4: 4 24
# 5: 5 120

R Given a list of same dimension data tables, produce a summary of the means of each cell

I'm finding it hard to put what I want into words so I will try to run through an example to explain it. Let's say I've repeated an experiment twice and have two tables:
[df1] [df2]
X Y X Y
2 3 4 1
5 2 2 4
These tables are stored in a list (where the list can contain more than two elements if necessary), and what I want to do is create an average of each cell in the tables across the list (or for a generalised version, apply any function I choose to the cells i.e. mad, sd, etc)
[df1] [df2] [dfMeans]
X Y X Y X Y
2 3 4 1 mean(2,4) mean(3,1)
5 2 2 4 mean(5,2) mean(2,4)
I have a code solution to my problem, but since this is in R there is most likely a cleaner way to do things:
df1 <- data.frame(X=c(2,3,4),Y=c(3,2,1))
df2 <- data.frame(X=c(5,1,3),Y=c(4,1,4))
df3 <- data.frame(X=c(2,7,4),Y=c(1,7,6))
dfList <- list(df1,df2,df3)
dfMeans <- data.frame(MeanX=c(NA,NA,NA),MeanY=c(NA,NA,NA))
for (rowIndex in 1:nrow(df1)) {
for (colIndex in 1:ncol(df1)) {
valuesAtCell <- c()
for (tableIndex in 1:length(dfList)) {
valuesAtCell <- c(valuesAtCell, dfList[[tableIndex]][rowIndex,colIndex])
}
dfMeans[rowIndex, colIndex] <- mean(valuesAtCell)
}
}
print(dfMeans)
Here is a data.table solution where the mean is applied row-wise across the data frames:
library(data.table)
dtList <- rbindlist(dfList, use.names = TRUE, idcol = TRUE)
dtList
.id X Y
1: 1 2 3
2: 1 3 2
3: 1 4 1
4: 2 5 4
5: 2 1 1
6: 2 3 4
7: 3 2 1
8: 3 7 7
9: 3 4 6
dtList[, rn := 1:.N, by = .id][][, .(X = mean(X), Y = mean(Y)), by = rn]
rn X Y
1: 1 3.000000 2.666667
2: 2 3.666667 3.333333
3: 3 3.666667 3.666667
You can replace the mean by another aggregation function, eg, median. The .id column numbers the original data frames each row was sourced from.
Edit
The solution can be extended to an arbitrary number of columns (provided column names and column order are identical in all data frames):
cn <- colnames(df1)
cn
[1] "X" "Y"
dtList[, rn := 1:.N, by = .id][, lapply(.SD, mean), by = rn, .SDcols = cn][, rn := NULL][]
X Y
1: 3.000000 2.666667
2: 3.666667 3.333333
3: 3.666667 3.666667
The column names are taken from one of the original data frames which adds to the flexibility of the solution. [, rn := NULL] removes the row numbers from the result, [] ensures the result ist printed.
You could simply sum all data.frame's in your list using Reduce(), and divide by the length of dfList, which is equal to the number of df's it contains.
Reduce(`+`, dfList) / length(dfList)
# X Y
#1 3.000000 2.666667
#2 3.666667 3.333333
#3 3.666667 3.666667

What does ".N" mean in data.table?

I have a data.table dt:
library(data.table)
dt = data.table(a=LETTERS[c(1,1:3)],b=4:7)
a b
1: A 4
2: A 5
3: B 6
4: C 7
The result of dt[, .N, by=a] is
a N
1: A 2
2: B 1
3: C 1
I know the by=a or by="a" means grouped by a column and the N column is the sum of duplicated times of a. However, I don't use nrow() but I get the result. The .N is not just the column name? I can't find the document by ??".N" in R. I tried to use .K, but it doesn't work. What does .N means?
Think of .N as a variable for the number of instances. For example:
dt <- data.table(a = LETTERS[c(1,1:3)], b = 4:7)
dt[.N] # returns the last row
# a b
# 1: C 7
Your example returns a new variable with the number of rows per case:
dt[, new_var := .N, by = a]
dt
# a b new_var
# 1: A 4 2 # 2 'A's
# 2: A 5 2
# 3: B 6 1 # 1 'B'
# 4: C 7 1 # 1 'C'
For a list of all special symbols of data.table, see also https://www.rdocumentation.org/packages/data.table/versions/1.10.0/topics/special-symbols

Resources