How to iteratively sum the diagonals of an incidence table in R - r

I have a dataframe of incident cases of a disease. by year and age, which looks like this (it is much larger than this example)
88 89 90 91
22 1 2 5 14
23 1 6 9 15
24 2 5 12 11
25 3 3 7 20
What I would like to do is iteratively sum the diagonals, to get this result
88 89 90 91
22 1 2 5 14
23 1 7 11 20
24 2 6 19 22
25 3 5 13 39
Or, put another way; original dataset:
Y1 Y2 Y3 Y4
22 A1 B1 C1 D1
23 A2 B2 C2 D2
24 A3 B3 C3 D3
25 A4 B4 C4 D4
Final dataset:
Y1 Y2 Y3 Y4
22 A1 B1 C1 D1
23 A2 A1+B2 B1+C2 C1+D2
24 A3 A2+B3 A1+B2+C3 B1+C2+D3
25 A4 A3+B4 A2+B3+C4 A1+B2+C3+D4
Is there any way to do this in R?
I have seen this question How to sum over diagonals of data frame, but he only wants the total sum, I want the iterative sum.
Thanks.

Use ave noting that row(m) - col(m) is constant on diagonals:
ave(m, row(m) - col(m), FUN = cumsum)
## 88 89 90 91
## 22 1 2 5 14
## 23 1 7 11 20
## 24 2 6 19 22
## 25 3 5 13 39
It is assumed that m is a matrix as in the Note below. If you have a data frame then convert it to a matrix first.
Note
The input matrix m in reproducible form is:
Lines <- " 88 89 90 91
22 1 2 5 14
23 1 6 9 15
24 2 5 12 11
25 3 3 7 20"
m <- as.matrix(read.table(text = Lines, check.names = FALSE))

Related

Combination of all pairs of rows using R

Here is my dataset:
data <- read.table(header = TRUE, text = "
group index group_index x y z
a 1 a1 12 13 14
a 2 a2 15 20 22
b 1 b1 24 17 28
b 2 b2 12 19 30
b 3 b3 31 32 33 ")
For each case in group "a" and each case in group "b", I wanna combine their x, y, z values in a row, so the data matrix or dataframe I want will look like:
[,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,] a1_b1 12 13 14 24 17 28 # x,y,z for a1, follows by x,y,z for b1
[2,] a1_b2 12 13 14 12 19 30 # x,y,z for a1, follows by x,y,z for b2
[3,] a1_b3 12 13 14 31 32 33
[4,] a2_b1 15 20 22 24 17 28 # x,y,z for a2, follows by x,y,z for b1
[5,] a2_b2 15 20 22 12 19 30
[6,] a2_b3 15 20 22 31 32 33
I'm wondering how to achieve this goal? Thanks so much!
We can split data based on group and take a cartesian product using merge
list_df <- split(data[c("x", "y", "z")], data$group)
out <- merge(list_df[[1]], list_df[[2]], by = NULL)
out[do.call(order, out), ]
# x.x y.x z.x x.y y.y z.y
#3 12 13 14 12 19 30
#1 12 13 14 24 17 28
#5 12 13 14 31 32 33
#4 15 20 22 12 19 30
#2 15 20 22 24 17 28
#6 15 20 22 31 32 33
You could also do a join on non-matching group values (< instead of != to avoid repeating pairs)
library(data.table)
setDT(data)
data[data, on = .(group < group),
.(g = paste0(group_index, '_', i.group_index),
x, y, z, i.x, i.y, i.z),
nomatch = NULL]
# g x y z i.x i.y i.z
# 1: a1_b1 12 13 14 24 17 28
# 2: a2_b1 15 20 22 24 17 28
# 3: a1_b2 12 13 14 12 19 30
# 4: a2_b2 15 20 22 12 19 30
# 5: a1_b3 12 13 14 31 32 33
# 6: a2_b3 15 20 22 31 32 33
A simple solution using dplyr:
library(tidyverse)
dcross <- left_join(data, data, by=character(), suffix=c("1", "2")) |>
filter(group1 != group2)
# index1 group_index1 x1 y1 index2 group_index2 x2 y2
# 1 1 a1 12 13 1 b1 24 17
# 2 1 a1 12 13 2 b2 12 19
# 3 1 a1 12 13 3 b3 31 32
# 4 2 a2 15 20 1 b1 24 17
# 5 2 a2 15 20 2 b2 12 19
# 6 2 a2 15 20 3 b3 31 32
And to get the described matrix from the dataframe
dcross |>
select(matches("^[xyz]\\d")) |>
as.matrix()
# x1 y1 z1 x2 y2 z2
# [1,] 12 13 14 24 17 28
# [2,] 12 13 14 12 19 30
# [3,] 12 13 14 31 32 33
# [4,] 15 20 22 24 17 28
# [5,] 15 20 22 12 19 30
# [6,] 15 20 22 31 32 33

How to create a column which use its own lag value using dplyr

Suppose I have the following data frame
c1<- c(1:10)
c2<- c(11:20)
df<- data.frame(c1,c2)
c1 c2
1 11
2 12
3 13
4 14
5 15
6 16
7 17
8 18
9 19
10 20
I would like to add a column c3 which is the sum of c3(-1)+c2-c1. For instance,
in the example above the expected result will be:
c1 c2 c3
1 11 0
2 12 10
3 13 20
4 14 30
5 15 40
6 16 50
7 17 60
8 18 70
9 19 80
10 20 90
Is it possible to perform this operation using dplyr ? I have tried several approaches without success. Any suggestion will be much appreciated.
This is a good use for cumsum - cumulative summation.
c3 = lag(cumsum(c2 - c1), default = 0)
Don't think of c3 as c3(-1) + c2 - c1, think of it as c3(n) = sum (from 1 to n - 1) c2(i) - c1(i)
This creates column c3. Assuming the first entry is always 0, since there is no preceding element.
df$c3 <- df$c2 - df$c1
df[1,"c3"] <- 0
df$c3 <- cumsum(df$c3)
output
> df
c1 c2 c3
1 1 11 0
2 2 12 10
3 3 13 20
4 4 14 30
5 5 15 40
6 6 16 50
7 7 17 60
8 8 18 70
9 9 19 80
10 10 20 90
>

find duplicated rows of a data frame in R [duplicate]

I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22
You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5
An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5
Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20
There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22
Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22
first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...

Can't read correctly the value type of dataframe elements

I have a data frame
SSIM_BEST=
X1 X2 X3 X4 X5
1 1 36 0.939323 B4 ON
2 1 35 0.943645 B2 ON
3 1 34 0.948516 B2 ON
4 1 33 0.952599 ZL ON
5 1 32 0.956492 ZL ON
6 1 31 0.960432 ZL ON
7 1 30 0.963957 ZL ON
8 1 29 0.96664 ZL ON
9 1 28 0.969612 ZL ON
10 1 27 0.97234 ZL ON
11 1 26 0.97478 ZL ON
12 1 25 0.977332 ZL ON
13 1 24 0.979606 ZL ON
14 1 23 0.981423 ZL ON
15 1 22 0.983776 ZL ON
I have for loop to read some values from X3 column, like:
SSIM=c()
for (j in seq(1,dim(SSIM_BEST)[1], by=2)) {
SSIM= c(SSIM, SSIM_BEST$X3[[j]]))
}
Instead of getting values like 0.939323,0.948516... I get SSIM=20 27 33 39 44 52 56 61 and I don't know what is going on.
In case I use print(SSIM_BEST$X3[[j]]) in the for-loop I get something like:
[1] 0.939323
72 Levels: 0.894559 0.899583 0.901154 0.907706 0.914609 0.914673 0.91996 0.920569 0.922076 0.925761 0.925897 0.926495 0.928728 0.931108 ... 0.992964
P.S. SSIM_BEST contains more than 15 rows. I show 15 here for example purposes.
Can you help me please?
We can create TRUE/FALSE vector to subset.
# data
SSIM_BEST <- read.table(text ="
X1 X2 X3 X4 X5
1 1 36 0.939323 B4 ON
2 1 35 0.943645 B2 ON
3 1 34 0.948516 B2 ON
4 1 33 0.952599 ZL ON
5 1 32 0.956492 ZL ON
6 1 31 0.960432 ZL ON
7 1 30 0.963957 ZL ON
8 1 29 0.96664 ZL ON
9 1 28 0.969612 ZL ON
10 1 27 0.97234 ZL ON
11 1 26 0.97478 ZL ON
12 1 25 0.977332 ZL ON
13 1 24 0.979606 ZL ON
14 1 23 0.981423 ZL ON
15 1 22 0.983776 ZL ON", header = TRUE)
# get odd rows
SSIM_BEST[c(TRUE, FALSE), "X3"]
# more generic solution
mySkip = 2
SSIM_BEST[seq(nrow(SSIM_BEST)) %% mySkip == 1, "X3"]
I think its because SSIM_BEST$X3 is a factor. I'm willing to bet the values you get from the for loop are the levels of the factors.
I have a couple options that should both work.
SSIM=c()
SSIM_BEST$X3 <- as.numeric(SSIM_BEST$X3)
for (j in seq(1,dim(SSIM_BEST)[1], by=2)) {
SSIM= c(SSIM, SSIM_BEST$X3[[j]]))
}
Or
SSIM=c()
for (j in seq(1,dim(SSIM_BEST)[1], by=2)) {
SSIM= c(SSIM, as.numeric(SSIM_BEST$X3[[j]])))
}
As Frank said, for loops are not good. I wrote a simple function that can do what you want without a for loop.
getDat <- function(data,by=2,start=1) {
v <- (1:length(data) %% by == 1)
if(start > 1){
v <- c(v,rep(F,start-1))
v <- shift(v,start-1)
is.na(v) <- FALSE
v <- v[1:(length(v) - (start-1))]
}
data <- data[v]
data[!is.na(data)]
}
This also allows you to specify where to start in the vector.
x <- 1:50
getDat(x,2)
[1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49
getDat(x,2,2)
[1] 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
getDat(x,3,10)
[1] 10 13 16 19 22 25 28 31 34 37 40 43 46 49

How to output duplicated rows

I have the following data:
x1 x2 x3 x4
34 14 45 53
2 8 18 17
34 14 45 20
19 78 21 48
2 8 18 5
In rows 1 and 3; and 2 and 5 the values for columns X1;X2,X3 are equal. How can I output only those 4 rows, with equal numbers? The output should be in the following format:
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
Please, ask me questions if something unclear.
ADDITIONAL QUESTION: in the output
x1 x2 x3 x4
34 14 45 53
34 14 45 20
2 8 18 17
2 8 18 5
find the sum of values in last column:
x1 x2 x3 x4
34 14 45 73
2 8 18 22
You can do this with duplicated, which checks for rows being duplicated when passed a matrix. Since you're only checking the first three columns, you should pass dat[,-4] to the function.
dat[duplicated(dat[,-4]) | duplicated(dat[,-4], fromLast=T),]
# x1 x2 x3 x4
# 1 34 14 45 53
# 2 2 8 18 17
# 3 34 14 45 20
# 5 2 8 18 5
An alternative using ave:
dat[ave(dat[,1], dat[-4], FUN=length) > 1,]
# x1 x2 x3 x4
#1 34 14 45 53
#2 2 8 18 17
#3 34 14 45 20
#5 2 8 18 5
Learned this one the other day. You won't need to re-order the output.
s <- split(dat, do.call(paste, dat[-4]))
Reduce(rbind, Filter(function(x) nrow(x) > 1, s))
# x1 x2 x3 x4
# 2 2 8 18 17
# 5 2 8 18 5
# 1 34 14 45 53
# 3 34 14 45 20
There is another way to solve both questions using two packages.
library(DescTools)
library(dplyr)
dat[AllDuplicated(dat[1:3]), ] %>% # this line is to find duplicates
group_by(x1, x2) %>% # the lines followed are to sum up
mutate(x4 = sum(x4)) %>%
unique()
# Source: local data frame [2 x 4]
# Groups: x1, x2
#
# x1 x2 x3 x4
# 1 34 14 45 73
# 2 2 8 18 22
Can also use table command:
> d1 = ddf[ddf$x1 %in% ddf$x1[which(table(ddf$x1)>1)],]
> d2 = ddf[ddf$x2 %in% ddf$x2[which(table(ddf$x2)>1)],]
> rr = rbind(d1, d2)
> rr[!duplicated(rbind(d1, d2)),]
x1 x2 x3 x4
1 34 14 45 53
3 34 14 45 20
2 2 8 18 17
5 2 8 18 5
For sum in last column:
> rrt = data.table(rr2)
> rrt[,x4:=sum(x4),by=x1]
> rrt[rrt[,!duplicated(x1),]]
x1 x2 x3 x4
1: 34 14 45 73
2: 2 8 18 22
first one similar as above, let z be your data.frame:
library(DescTools)
(zz <- Sort(z[AllDuplicated(z[, -4]), ], decreasing=TRUE) )
# now aggregate
aggregate(zz[, 4], zz[, -4], FUN=sum)
# use Sort again, if needed...

Resources