How to join tables and generate sums of columns? - r

I have several tables( two in particular example) with the same structure. I would like to join on ID_Position & ID_Name and generate the sum of January and February in the output table (There might be some NAs in both columns)
ID_Position<-c(1,2,3,4,5,6,7,8,9,10)
Position<-c("A","B","C","D","E","H","I","J","X","W")
ID_Name<-c(11,12,13,14,15,16,17,18,19,20)
Name<-c("Michael","Tobi","Chris","Hans","Likas","Martin","Seba","Li","Sha","Susi")
jan<-c(10,20,30,22,23,2,22,24,26,28)
feb<-c(10,30,20,12,NA,3,NA,22,24,26)
df1 <- data.frame(ID_Position,Position,ID_Name,Name,jan,feb)
ID_Position<-c(1,2,3,4,5,6,7,8,9,10)
Position<-c("A","B","C","D","E","H","I","J","X","W")
ID_Name<-c(11,12,13,14,15,16,17,18,19,20)
Name<-c("Michael","Tobi","Chris","Hans","Likas","Martin","Seba","Li","Sha","Susi")
jan<-c(10,20,30,22,NA,NA,22,24,26,28)
feb<-c(10,30,20,12,23,3,3,22,24,26)
df2 <- data.frame(ID_Position,Position,ID_Name,Name,jan,feb)
I tried the inner and the full join. But that seems to work as I desire:
library(plyr)
test<-join(df1, df2, by =c("ID_Position","ID_Name") , type = "inner", match = "all")
Desired output:
ID_Position Position ID_Name Name jan feb
1 A 11 Michael 20 20
2 B 12 Tobi 40 60
3 C 13 Chris 60 40
4 D 14 Hans 44 24
5 E 15 Likas 23 23
6 H 16 Martin 2 6
7 I 17 Seba 44 22
8 J 18 Li 48 44
9 X 19 Sha 52 48
10 W 20 Susi 56 52

Your desired output doesn't seem entirely correct, but here's an example of how you can do this efficiently using data.table binary join which allows you to efficiently run functions while joining using the by = .EACHI option
library(data.table)
setkey(setDT(df1), ID_Position, ID_Name, Name)
setkey(setDT(df2), ID_Position, ID_Name, Name)
df2[df1, .(jan = sum(jan, i.jan, na.rm = TRUE),
feb = sum(feb, i.feb, na.rm = TRUE)),
by = .EACHI]
# ID_Position ID_Name Name jan feb
# 1: 1 11 Michael 20 20
# 2: 2 12 Tobi 40 60
# 3: 3 13 Chris 60 40
# 4: 4 14 Hans 44 24
# 5: 5 15 Likas 46 0
# 6: 6 16 Martin 0 6
# 7: 7 17 Seba 44 0
# 8: 8 18 Li 48 44
# 9: 9 19 Sha 52 48
# 10: 10 20 Susi 56 52

Related

How to use cumsum to get the cumulative sum of a freq table - R

I have the following frequency table (code copy below). This frequency table shows the monday of a given week and the number of observations on that given week.
print(helper_table)
Week n
1: 10/04/2021 42
2: 10/11/2021 18
3: 10/18/2021 40
4: 10/25/2021 33
5: 11/01/2021 29
6: 11/08/2021 27
7: 11/15/2021 43
8: 11/22/2021 41
9: 11/29/2021 17
10: 12/06/2021 27
11: 12/13/2021 27
12: 12/20/2021 26
13: 12/27/2021 13
14: 01/03/2022 10
15: 01/10/2022 15
16: 01/17/2022 13
17: 01/24/2022 15
18: 01/31/2022 20
19: 02/07/2022 30
20: 02/14/2022 14
21: 02/21/2022 20
22: 02/28/2022 7
But my goal is to have another column that shows the cumulative frequency. I'm trying this code but its just duplicating the n column. At thought that maybe because the n column was a integer that was preventing it from working but I did convert it to numeric.
helper_table$n <- as.numeric(helper_table$n)
helper_table %>%
group_by(Week) %>%
#arrange(desc(x1)) %>%
mutate(cumsum=cumsum(n))
And this resulted in the following output:
# A tibble: 22 x 3
# Groups: Week [22]
Week n cumsum
<chr> <dbl> <dbl>
1 10/04/2021 42 42
2 10/11/2021 18 18
3 10/18/2021 40 40
4 10/25/2021 33 33
5 11/01/2021 29 29
6 11/08/2021 27 27
7 11/15/2021 43 43
8 11/22/2021 41 41
9 11/29/2021 17 17
10 12/06/2021 27 27
# ... with 12 more rows
Data:
data.table::data.table(
Week = c("10/04/2021","10/11/2021","10/18/2021",
"10/25/2021","11/01/2021","11/08/2021","11/15/2021",
"11/22/2021","11/29/2021","12/06/2021","12/13/2021","12/20/2021",
"12/27/2021","01/03/2022","01/10/2022","01/17/2022",
"01/24/2022","01/31/2022","02/07/2022","02/14/2022","02/21/2022",
"02/28/2022"),
n = c(42,18,40,33,29,27,43,41,17,27,27,
26,13,10,15,13,15,20,30,14,20,7)
)
Perhaps changing:
mutate(cumsum=cumsum(n))
By
mutate(cumsum=cumsum(1:n))
As pointed out by Gregor Thomas and Waldi in the comments, removing the grouping variable achieves the required result:
helper_table %>%
mutate(cumsum=cumsum(n))
# Week n cumsum
#1: 10/04/2021 42 42
#2: 10/11/2021 18 60
#3: 10/18/2021 40 100
#4: 10/25/2021 33 133
#5: 11/01/2021 29 162
#...

Select multiple ranges of columns using column names in data.table

Let say I have a data table,
dt = data.table(matrix(1:50, nrow = 5));
colnames(dt) = letters[1:10];
> dt
a b c d e f g h i j
1: 1 6 11 16 21 26 31 36 41 46
2: 2 7 12 17 22 27 32 37 42 47
3: 3 8 13 18 23 28 33 38 43 48
4: 4 9 14 19 24 29 34 39 44 49
5: 5 10 15 20 25 30 35 40 45 50
I want to select several discontinuous ranges of columns like: a, c:d, f:h and j. This can be done easily via dplyr's select():
dt %>% select(a, c:d, f:h, j)
I am looking for a data.table way of achieving the same.
Right now, I can either select columns individually in any order: dt[ , .(a, c)] or giving just one sequence of column names on the form startcol:endcol:
dt[ , c:f]
However, I can't combine the above two methods to select several column ranges in one shot in .SDcols, like I did in dplyr::select
We can use the range part in .SDcols and then append the other column by concatenating
dt[, c(list(a= a), .SD) , .SDcols = c:d]
If there are multiple ranges, we create a sequence of ranges by match, and then get the corresponding column names
i1 <- match(c("c", "f"), names(dt))
j1 <- match(c("d", "h"), names(dt))
nm1 <- c("a", names(dt)[unlist(Map(`:`, i1, j1))], "j")
dt[, ..nm1]
# a c d f g h j
#1: 1 11 16 26 31 36 46
#2: 2 12 17 27 32 37 47
#3: 3 13 18 28 33 38 48
#4: 4 14 19 29 34 39 49
#5: 5 15 20 30 35 40 50
Also, the dplyr methods can be used within the data.table
dt[, select(.SD, a, c:d, f:h, j)]
# a c d f g h j
#1: 1 11 16 26 31 36 46
#2: 2 12 17 27 32 37 47
#3: 3 13 18 28 33 38 48
#4: 4 14 19 29 34 39 49
#5: 5 15 20 30 35 40 50
Here is a workaround with cbind and two or more selections.
cbind(dt[, .(a)], dt[, c:d])
# a c d
# 1: 1 11 16
# 2: 2 12 17
# 3: 3 13 18
# 4: 4 14 19
# 5: 5 15 20

Subset data frame where values are greater than another data frame

Say I have a data frame with 3 columns of data (a,b,c) and 1 column of categories with multiple instances of each category (class).
set.seed(273)
a <- floor(runif(20,0,100))
b <- floor(runif(20,0,100))
c <- floor(runif(20,0,100))
class <- floor(runif(20,0,6))
df1 <- data.frame(a,b,c,class)
print(df1)
a b c class
1 31 73 28 3
2 44 33 57 3
3 19 35 53 0
4 68 70 39 4
5 92 7 57 2
6 13 67 23 3
7 73 50 14 2
8 59 14 91 5
9 37 3 72 5
10 27 3 13 4
11 63 28 0 5
12 51 7 35 4
13 11 36 76 3
14 72 25 8 5
15 23 24 6 3
16 15 1 16 5
17 55 24 5 5
18 2 54 39 1
19 54 95 20 3
20 60 39 65 1
And I have another data frame with the same 3 columns of data and category column, however this only has one instance per category (class).
a <- floor(runif(6,0,20))
b <- floor(runif(6,0,20))
c <- floor(runif(6,0,20))
class <- seq(0,5)
df2 <- data.frame(a,b,c,class)
print(df2)
a b c class
1 8 15 13 0
2 0 3 6 1
3 14 4 0 2
4 7 10 6 3
5 18 18 16 4
6 17 17 11 5
How to I subset the first data frame so that only rows where a, b, and c are all greater than the value in the second data frame for each class? For example, I only want rows where class == 0 if a > 8 & b > 15 & c > 13.
Note that I don't want to join the data frames, as the second data frame is the lowest acceptable value for the the first data frame.
As commented by Frank this can be done with non-equi joins.
# coerce to data.table
tmp <- setDT(df1)[
# non-equi join to find which rows of df1 fulfill conditions in df2
setDT(df2), on = .(class, a > a, b > b, c > c), rn, nomatch = 0L, which = TRUE]
# return subset in original order of df1
df1[sort(tmp)]
a b c class
1: 31 73 28 3
2: 44 33 57 3
3: 19 35 53 0
4: 68 70 39 4
5: 92 7 57 2
6: 13 67 23 3
7: 73 50 14 2
8: 11 36 76 3
9: 2 54 39 1
10: 54 95 20 3
11: 60 39 65 1
The parameter which = TRUE returns a vector of the matching row numbers instead of the joined data set. This saves us from creating a row id column before the join. (Credit to #Frank for reminding me of the which parameter!)
Note that there is no row in df1 which fulfills the condition for class == 5 in df2. Therefore, the parameter nomatch = 0L is used to exclude non-matching rows from the result.
This can be put together in a "one-liner":
setDT(df1)[sort(df1[setDT(df2), on = .(class, a > a, b > b, c > c), nomatch = 0L, which = TRUE])]

Selecting rows based on index positions, when the index positions are passed through variables in data.table R

I can select one column by index position in data.table by passing the index position through a variable like this:
DT <- data.table(a = 1:6, b=10:15, c=20:25, d=30:35, e = 40:45)
i <- 1
j <- 5
DT[, ..i]
But how can I select columns i : i+2 and j in one line of code using data.table syntax?
Your advice will be appreciated.
If you don't want to use lukeA's approach using the with = FALSE parameter you have other choices as well:
DT[, .SD, .SDcols = c(i:(i+2), j)]
# a b c e
#1: 1 10 20 40
#2: 2 11 21 41
#3: 3 12 22 42
#4: 4 13 23 43
#5: 5 14 24 44
#6: 6 15 25 45
Note the parantheses around (i+2) because the colon operator takes precendence.
This one is a modification of OP's code and not exactly a one-liner as requested:
icol <- c(i:(i+2), j); DT[, ..icol]
a b c e
1: 1 10 20 40
2: 2 11 21 41
3: 3 12 22 42
4: 4 13 23 43
5: 5 14 24 44
6: 6 15 25 45

Add data frames row wise with [d]plyr

I have two data frames
df1
# a b
# 1 10 20
# 2 11 21
# 3 12 22
# 4 13 23
# 5 14 24
# 6 15 25
df2
# a b
# 1 4 8
I want the following output:
df3
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
i.e. add df2 to each row of df1.
Is there a way to get the desired output using plyr (mdplyr??) or dplyr?
I see no reason for "dplyr" for something like this. In base R you could just do:
df1 + unclass(df2)
# a b
# 1 14 28
# 2 15 29
# 3 16 30
# 4 17 31
# 5 18 32
# 6 19 33
Which is the same as df1 + list(4, 8).
One liner with dplyr.
mutate_each(df1, funs(.+ df2$.), a:b)
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33
A base R solution using sweet function sweep:
sweep(df1, 2, unlist(df2), '+')
# a b
#1 14 28
#2 15 29
#3 16 30
#4 17 31
#5 18 32
#6 19 33

Resources