Creating a total row based on the values of another column - r

Let's consider the following example:
set.seed(5)
df <- data.frame(CATEGORY = rep(c("A", "B", "C", "D"), each = 2),
SUBCATEGORY = paste0(rep(c("A", "B", "C", "D"), each = 2), 1:2),
COUNT = sample(1:1000, size = 8, replace = TRUE),
SUBCOUNT = sample(1:200, size = 8, replace = TRUE),
stringsAsFactors = FALSE)
df$SUBCOUNT_PCT <- paste0(formatC(df$SUBCOUNT/df$COUNT * 100, digits = 2, format = 'f'), "%")
> df
CATEGORY SUBCATEGORY COUNT SUBCOUNT SUBCOUNT_PCT
1 A A1 201 192 95.52%
2 A A2 686 23 3.35%
3 B B1 917 55 6.00%
4 B B2 285 99 34.74%
5 C C1 105 64 60.95%
6 C C2 702 112 15.95%
7 D D1 528 53 10.04%
8 D D2 808 41 5.07%
I would like to create rows for CATEGORY which aggregate COUNT and SUBCOUNT as follows:
CATEGORY SUBCATEGORY COUNT SUBCOUNT SUBCOUNT_PCT
1 A TOTAL 887 215 24.24%
2 A A1 201 192 95.52%
3 A A2 686 23 3.35%
4 B TOTAL 1202 154 12.81%
5 B B1 917 55 6.00%
6 B B2 285 99 34.74%
7 C TOTAL 807 176 21.81%
8 C C1 105 64 60.95%
9 C C2 702 112 10.04%
10 D TOTAL 1336 94 7.04%
11 D D1 528 53 10.04%
12 D D2 808 41 5.07%
Is there a way to do this without having to loop through every CATEGORY?

Using dplyr to summarize data and then bind back to original data
library(dplyr)
df %>%
group_by(CATEGORY) %>%
summarize(SUBCATEGORY = "TOTAL",
COUNT = sum(COUNT),
SUBCOUNT = sum(SUBCOUNT),
SUBCOUNT_PCT = sprintf("%.2f%%", SUBCOUNT / COUNT * 100)) %>%
bind_rows(., df) %>%
arrange(CATEGORY)
# A tibble: 12 x 5
CATEGORY SUBCATEGORY COUNT SUBCOUNT SUBCOUNT_PCT
<chr> <chr> <int> <int> <chr>
1 A TOTAL 887 215 24.24%
2 A A1 201 192 95.52%
3 A A2 686 23 3.35%
4 B TOTAL 1202 154 12.81%
5 B B1 917 55 6.00%
6 B B2 285 99 34.74%
7 C TOTAL 807 176 21.81%
8 C C1 105 64 60.95%
9 C C2 702 112 15.95%
10 D TOTAL 1336 94 7.04%
11 D D1 528 53 10.04%
12 D D2 808 41 5.07%

Related

How to identify data which does not show link between two data sets? [duplicate]

This question already has answers here:
Find complement of a data frame (anti - join)
(7 answers)
Closed 4 years ago.
Dataset1:
id1 id2 abc n
1 111 yes 2
2 121 no 1
3 122 yes 2
4 224 no 2
5 441 no 3
6 665 yes 1
Dataset2:
id1 id2 age gen
1 111 45 m
1 111 46 f
2 1 52 f
121 122 41 f
121 122 44 m
4 224 54 f
4 221 56 m
5 441 44 m
5 441 45 f
5 441 58 f
6 665 54 f
I have two data sets. Both are linked by id1 and id2. How to identify those data from both data sets which fails to link???
We can use anti_join from the dplyr package to filter the rows with no match.
library(dplyr)
Dataset1_anti <- Dataset1 %>% anti_join(Dataset2, by = c("id1", "id2"))
Dataset1_anti
# id1 id2 abc n
# 1 2 121 no 1
# 2 3 122 yes 2
Dataset2_anti <- Dataset2 %>% anti_join(Dataset1, by = c("id1", "id2"))
Dataset2_anti
# id1 id2 age gen
# 1 2 1 52 f
# 2 121 122 41 f
# 3 121 122 44 m
# 4 4 221 56 m
DATA
Dataset1 <- read.table(text = "id1 id2 abc n
1 111 yes 2
2 121 no 1
3 122 yes 2
4 224 no 2
5 441 no 3
6 665 yes 1 ",
header = TRUE, stringsAsFactors = FALSE)
Dataset2 <- read.table(text = "id1 id2 age gen
1 111 45 m
1 111 46 f
2 1 52 f
121 122 41 f
121 122 44 m
4 224 54 f
4 221 56 m
5 441 44 m
5 441 45 f
5 441 58 f
6 665 54 f ",
header = TRUE, stringsAsFactors = FALSE)

calculate difference between values in different row and different column

I have a dataframe like this:
ID s1 e1 s2 e2
A 50 150 80 180
A 160 350 280 470
A 355 700 800 1150
B 100 500 150 550
B 550 1500 800 1750
When the ID is identical I would like to calculate the difference between values in consecutive rows but different columns (for ID A: s1 in row2 minus e1 in row1; s1 in row3 minus e1 in row2; s2 in row2 minus e2 in row1; s2 in row3 minus e2 in row2) and add these values to a new column (diff1 and diff2).
The dataframe would then look like this:
ID s1 e1 s2 e2 diff1 diff2
A 50 150 80 180
A 160 350 280 470 10 100
A 355 700 800 1150 5 330
B 100 500 150 550
B 550 1500 800 1750 50 250
Is this possible?
Thank you in advance
WD
After grouping by 'ID', get the lead of 's1', subtract it from 'e1', and create 'diff1' as the lag of this output. Similarly, the 'diff2' can be created the corresponding pairs of 's2' and 'e2' columns
library(dplyr)
df1 %>%
group_by(ID) %>%
mutate(diff1 = lag(lead(s1) - e1), diff2 = lag(lead(s2)- e2))
# A tibble: 5 x 7
# Groups: ID [2]
# ID s1 e1 s2 e2 diff1 diff2
# <chr> <int> <int> <int> <int> <int> <int>
#1 A 50 150 80 180 NA NA
#2 A 160 350 280 470 10 100
#3 A 355 700 800 1150 5 330
#4 B 100 500 150 550 NA NA
#5 B 550 1500 800 1750 50 250
If there are multiple 's', 'e' pairs, one option with data.table would be to melt it to 'long' format and then dcast to 'wide' after doing the necessary calculation
library(data.table)
dnew <- dcast(melt(setDT(df1, keep.rownames = TRUE),
measure = patterns("^s\\d+", "^e\\d+"), value.name = c("s", "e"))[,
diffs := shift(shift(s, type = "lead") - e), .(ID, variable)][],
rn + ID ~ paste0('diff', variable), value.var = 'diffs')
df1[, names(dnew)[3:4] := dnew[, 3:4, with = FALSE]][, rn := NULL][]
# ID s1 e1 s2 e2 diff1 diff2
#1: A 50 150 80 180 NA NA
#2: A 160 350 280 470 10 100
#3: A 355 700 800 1150 5 330
#4: B 100 500 150 550 NA NA
#5: B 550 1500 800 1750 50 250

Splitting columns of a dataframe to merge a repetitive variable

I normally find an answer in previous questions posted here, but I can't seem to find this one, so here is my maiden question:
I have a dataframe with one column with repetitive values, I would like to split the other columns and have only 1 value in the first column and more columns than in the original dataframe.
Example:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
The original dataframe has 3 columns and 15 rows.
And it would turn into a dataframe with 5 rows and the columns would be split into 7 columns: 'test', 'time1', 'time2', 'time3', 'score1', score2', 'score3'.
Does anyone have an idea how this could be done?
I think using dcast with rowid from the data.table-package is well suited for this task:
library(data.table)
dcast(setDT(df), test ~ rowid(test), value.var = c('time','score'), sep = '')
The result:
test time1 time2 time3 score1 score2 score3
1: 1 52 3 29 21 131 45
2: 2 79 44 6 119 1 186
3: 3 67 95 39 18 459 121
4: 4 83 50 40 493 466 497
5: 5 46 14 4 465 9 24
Please try this:
df <- data.frame(test = c(rep(1:5,3)), time = sample(1:100,15), score = sample(1:500,15))
df$class <- c(rep('a', 5), rep('b', 5), rep('c', 5))
df <- split(x = df, f = df$class)
binded <- cbind(df[[1]], df[[2]], df[[3]])
binded <- binded[,-c(5,9)]
> binded
test time score class time.1 score.1 class.1 time.2 score.2 class.2
1 1 40 404 a 57 409 b 70 32 c
2 2 5 119 a 32 336 b 93 177 c
3 3 20 345 a 44 91 b 100 42 c
4 4 47 468 a 60 265 b 24 478 c
5 5 16 52 a 38 219 b 3 92 c
Let me know if it works for you!

Aggregate function to group,count and mean

I have a dataset with three variable a b and c.
a 45 345
a 45 345
a 34 234
a 35 456
b 45 123
b 65 345
b 34 456
c 23 455
c 54 567
c 34 345
c 87 567
c 67 345
I want to aggregate the data set by a and b and give count and mean. Please find the below output. Is there any function to do both together.
A B numobs c
a 34 1 234
a 35 1 456
a 45 2 345
b 34 1 456
b 45 1 123
b 65 1 345
c 23 1 455
c 34 1 345
c 54 1 567
c 67 1 345
c 87 1 567
numobs is the count and c is the mean value
We can use dplyr
library(dplyr)
df1 %>%
group_by(A, B) %>%
mutate(numbobs =n(), C= mean(C))
Or with data.table
library(data.table)
setDT(df1)[, c("numbobs", "C") := .(.N, mean(C)) , by = .(A, B)]

Use function like cumulative sum by group or by each list element in R

I have the following data:
col1 = c(rep("a",4),rep("b",8),rep("c",6), rep("d",2))
col2 = sample(-100:250, 20)
col3 = cumsum(col2)
data = data.table(col1, col2, col3)
and data.table:
col1 col2 col3
1: a 56 56
2: a 90 146
3: a 85 231
4: a 214 445
5: b -39 406
6: b 116 522
7: b 42 564
8: b 131 695
9: b 161 856
10: b 54 910
11: b 15 925
12: b 229 1154
13: c 166 1320
14: c 224 1544
15: c -53 1491
16: c 87 1578
17: c -100 1478
18: c -11 1467
19: d 28 1495
20: d 143 1638
As you see it's just grouped by col1. I'd like to make some calculation (like cumsum, count if, etc) based on groups in col1.
In the end I'd would like to have:
col1 colsum countif>0 countif<0
a 445 4 0
b 709 7 1
c 313 3 3
d 171 2 0
#commentators
Guys! Please ... I did two solutions, the first very unsightly (no sense to put it here, but is based on making a list and loop with calculation for each element of list) and second this is:
a1 = aggregate (col2 ~ col1, sum, date = date)
a2 = aggregate (col2> 0 ~ col1, sum, date = date)
a3 = aggregate (col2 <0 ~ col1, sum, date = date)
cbind (a1, a2 counfif_1 = [2], counfif_2 = a3 [2])
I'm looking just for something nice and cool.
data[, list(colsum = sum(col2),
`countif>0` = sum(col2 > 0),
`countif<0` = sum(col2 < 0)), by = col1]
## col1 colsum countif>0 countif<0
## 1: a 445 4 0
## 2: b 709 7 1
## 3: c 313 3 3
## 4: d 171 2 0
You can use dplyr to achieve something similar
library(dplyr)
set.seed(1)
col1 <- c(rep("a", 4), rep("b", 8), rep("c", 6), rep("d",2))
col2 <- sample(-100:250, 20)
data <- tbl_df(data.frame(col1, col2))
str(data)
## Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 20 obs. of 3 variables:
## $ col1: Factor w/ 4 levels "a","b","c","d": 1 1 1 1 2 2 2 2 2 2 ...
## $ col2: int -7 30 99 216 -31 210 225 127 115 -79 ...
data %>%
group_by(col1) %>%
summarise(colsum = sum(col2),
countifpos = sum(col2 > 0),
countifneg = sum(col2 < 0))
## Source: local data frame [4 x 4]
## col1 colsum countifpos countifneg
## 1 a 338 3 1
## 2 b 497 4 4
## 3 c 758 6 0
## 4 d 184 2 0
You can use tapply to get summaries by group
for instance:
this is where you define the metrics you are calculating
metrics = function(x) { c(sum(x), length(x[x<0]) , length(x[x>0]) )}
the you use the metrics function to calculate your metrics by group via a tapply function
tapply (data$col2, data$col1, metrics)
$a
[1] 241 -50 291
$b
[1] 526 -86 612
$c
[1] 483 -94 577
$d
[1] -88 -88 0
You can then convert this output into a data frame as requested

Resources