R: sum at different levels [duplicate] - r

This question already has answers here:
generating sums of data according to values of a variable
(3 answers)
Closed 9 years ago.
I have a dataset X as:
customer_id event_type tot_count
931 1 5
231 2 6
231 1 3
333 3 9
444 1 1
931 3 3
333 1 21
444 2 43
I need a sum at customer_id and event_type level.
This is a 1 line code in SQL as:
select customer_id, event_type, sum(tot_count) from X group by 1,2
I need the same operation in R.

You can use the aggregate function:
aggregate(tot_count ~ customer_id + event_type, X, sum)
customer_id event_type tot_count
1 231 1 3
2 333 1 21
3 444 1 1
4 931 1 5
5 231 2 6
6 444 2 43
7 333 3 9
8 931 3 3

For fun, here are a few more options:
Since you know SQL, sqldf
> sqldf("select customer_id, event_type, sum(tot_count) from mydf group by 1,2")
customer_id event_type sum(tot_count)
1 231 1 3
2 231 2 6
3 333 1 21
4 333 3 9
5 444 1 1
6 444 2 43
7 931 1 5
8 931 3 3
If you have a lot of data, data.table
> library(data.table)
> DT <- data.table(mydf, key = c("customer_id", "event_type"))
> DT[, sum(tot_count), by = key(DT)]
customer_id event_type V1
1: 231 1 3
2: 231 2 6
3: 333 1 21
4: 333 3 9
5: 444 1 1
6: 444 2 43
7: 931 1 5
8: 931 3 3

Related

How to create a conditionally increasing sequence within a group?

I have a dataframe like the following:
df <- data.frame("id" = c(111,111,111,111,222,222,222,222,222,333,333,333),
"Encounter" = c(1,2,3,4,1,2,3,4,5,1,2,3),
"Level" = c(1,1,2,3,3,4,1,2,3,3,4,4),
"Gap_Days" = c(NA,3,2,15,NA,1,18,3,2,NA,77,1))
df
id Encounter Level Gap_Days
1 111 1 1 NA
2 111 2 1 3
3 111 3 2 2
4 111 4 3 15
5 222 1 3 NA
6 222 2 4 1
7 222 3 1 18
8 222 4 2 3
9 222 5 3 2
10 333 1 3 NA
11 333 2 4 77
12 333 3 4 1
Where Level is a numeric signaling a numeric signaling the type of encounter and Gap_Days is the number of days since the previous encounter, and is thus NA for the first encounter in each id group.
I'm looking to create a variable, "Session", that will start at 1 for the first Encounter within an id group, and increase sequentially when a Level fails to increase from the previous encounter, or when it takes more than 3 days between encounters. Basically it is considered a new "Session" each time these conditions aren't met for an Encounter. I'd like to do this within each group, ideally resulting in something like:
df2 <- data.frame("id" = c(111,111,111,111,222,222,222,222,222,333,333,333),
"Encounter" = c(1,2,3,4,1,2,3,4,5,1,2,3),
"Level" = c(1,1,2,3,3,4,1,2,3,3,4,4),
"Gap_Days" = c(NA,3,2,15,NA,1,18,3,2,NA,77,1),
"Session" = c(1,2,2,3,1,1,2,2,2,1,2,3))
df2
id Encounter Level Gap_Days Session
1 111 1 1 NA 1
2 111 2 1 3 2
3 111 3 2 2 2
4 111 4 3 15 3
5 222 1 3 NA 1
6 222 2 4 1 1
7 222 3 1 18 2
8 222 4 2 3 2
9 222 5 3 2 2
10 333 1 3 NA 1
11 333 2 4 77 2
12 333 3 4 1 3
In the actual data there are no strict limits to the number of Encounters or Sessions within each group. The first encounter can begin at any level, and it is not necessary that the level only increase by 1 i.e. if the level increased from 1 to 4 between encounters that could still be considered the same Session.
I'd prefer a dplyr solution, but am open to any ideas to help accomplish this!
You can do the following
library(dplyr)
df %>% group_by(id) %>% mutate(Session = cumsum(c(T, diff(Level) == 0) | Gap_Days > 3))
## A tibble: 12 x 5
## Groups: id [3]
# id Encounter Level Gap_Days Session
# <dbl> <dbl> <dbl> <dbl> <int>
# 1 111 1 1 NA 1
# 2 111 2 1 3 2
# 3 111 3 2 2 2
# 4 111 4 3 15 3
# 5 222 1 3 NA 1
# 6 222 2 4 1 1
# 7 222 3 1 18 2
# 8 222 4 2 3 2
# 9 222 5 3 2 2
#10 333 1 3 NA 1
#11 333 2 4 77 2
#12 333 3 4 1 3
You probably want to ungroup afterwards.

R Loop To New Data Frame Summary Weighted

I have a tall data frame as such:
data = data.frame("id"=c(1,2,3,4,5,6,7,8,9,10),
"group"=c(1,1,2,1,2,2,2,2,1,2),
"type"=c(1,1,2,3,2,2,3,3,3,1),
"score1"=c(sample(1:4,10,r=T)),
"score2"=c(sample(1:4,10,r=T)),
"score3"=c(sample(1:4,10,r=T)),
"score4"=c(sample(1:4,10,r=T)),
"score5"=c(sample(1:4,10,r=T)),
"weight1"=c(173,109,136,189,186,146,173,102,178,174),
"weight2"=c(147,187,125,126,120,165,142,129,144,197),
"weight3"=c(103,192,102,159,128,179,195,193,135,145),
"weight4"=c(114,182,199,101,111,116,198,123,119,181),
"weight5"=c(159,125,104,171,166,154,197,124,180,154))
library(reshape2)
library(plyr)
data1 <- reshape(data, direction = "long",
varying = list(c(paste0("score",1:5)),c(paste0("weight",1:5))),
v.names = c("score","weight"),
idvar = "id", timevar = "count", times = c(1:5))
data1 <- data1[order(data1$id), ]
And what I want to create is a new data frame like so:
want = data.frame("score"=rep(1:4,6),
"group"=rep(1:2,12),
"type"=rep(1:3,8),
"weightedCOUNT"=NA) # how to calculate this? count(data1, score, wt = weight)
I am just not sure how to calculate weightedCOUNT which should apply the weights to the score variable so then it gives in column 'weightedCOUNT' a weighted count that is aggregated by score and group and type.
An option would be to melt (from data.table - which can take multiple measure patterns, and then grouped by 'group', 'type' get the count
library(data.table)
library(dplyr)
melt(setDT(data), measure = patterns('^score', "^weight"),
value.name = c("score", "weight")) %>%
group_by(group, type) %>%
count(score, wt = weight)
If we need to have a complete set of combinations
library(tidyr)
melt(setDT(data), measure = patterns('^score', "^weight"),
value.name = c("score", "weight")) %>%
group_by(group, type) %>%
ungroup %>%
complete(group, type, score, fill = list(n = 0))
If I understand correctly, weightedCOUNT is the sum of weights grouped by score, group, and type.
For the sake of completeness, I would like to show how the accepted solution would look like when implemented in pure base R and pure data.table syntax, resp.
Base R
The OP was almost there. He has already reshaped data from wide to long format for multiple value variables. Only the final aggregation step was missing:
data1 <- reshape(data, direction = "long",
varying = list(c(paste0("score",1:5)),c(paste0("weight",1:5))),
v.names = c("score","weight"),
idvar = "id", timevar = "count", times = c(1:5))
result <- aggregate(weight ~ score + group + type, data1, FUN = sum)
result
score group type weight
1 1 1 1 479
2 3 1 1 558
3 4 1 1 454
4 1 2 1 378
5 2 2 1 154
6 3 2 1 174
7 4 2 1 145
8 1 2 2 535
9 2 2 2 855
10 3 2 2 248
11 4 2 2 499
12 1 1 3 189
13 2 1 3 351
14 3 1 3 600
15 4 1 3 362
16 1 2 3 596
17 2 2 3 265
18 3 2 3 193
19 4 2 3 522
result can be reordered by
with(result, result[order(score, group, type), ])
score group type weight
1 1 1 1 479
12 1 1 3 189
4 1 2 1 378
8 1 2 2 535
16 1 2 3 596
13 2 1 3 351
5 2 2 1 154
9 2 2 2 855
17 2 2 3 265
2 3 1 1 558
14 3 1 3 600
6 3 2 1 174
10 3 2 2 248
18 3 2 3 193
3 4 1 1 454
15 4 1 3 362
7 4 2 1 145
11 4 2 2 499
19 4 2 3 522
data.table
As shown by akrun, melt() from the data.table package can be combined with dplyr. Alternatively, we can stay with the data.table syntax for aggregation:
library(data.table)
cols <- c("score", "weight") # to save typing
melt(setDT(data), measure = patterns(cols), value.name = cols)[
, .(weightedCOUNT = sum(weight)), keyby = .(score, group, type)]
score group type weightedCOUNT
1: 1 1 1 479
2: 1 1 3 189
3: 1 2 1 378
4: 1 2 2 535
5: 1 2 3 596
6: 2 1 3 351
7: 2 2 1 154
8: 2 2 2 855
9: 2 2 3 265
10: 3 1 1 558
11: 3 1 3 600
12: 3 2 1 174
13: 3 2 2 248
14: 3 2 3 193
15: 4 1 1 454
16: 4 1 3 362
17: 4 2 1 145
18: 4 2 2 499
19: 4 2 3 522
The keyby parameter is used for grouping and ordering the output in one step.
Completion of missing combinations of the grouping variables is also possible in data.table syntax using the cross join function CJ():
melt(setDT(data), measure = patterns(cols), value.name = cols)[
, .(weightedCOUNT = sum(weight)), keyby = .(score, group, type)][
CJ(score, group, type, unique = TRUE), on = .(score, group, type)][
is.na(weightedCOUNT), weightedCOUNT := 0][]
score group type weightedCOUNT
1: 1 1 1 479
2: 1 1 2 0
3: 1 1 3 189
4: 1 2 1 378
5: 1 2 2 535
6: 1 2 3 596
7: 2 1 1 0
8: 2 1 2 0
9: 2 1 3 351
10: 2 2 1 154
11: 2 2 2 855
12: 2 2 3 265
13: 3 1 1 558
14: 3 1 2 0
15: 3 1 3 600
16: 3 2 1 174
17: 3 2 2 248
18: 3 2 3 193
19: 4 1 1 454
20: 4 1 2 0
21: 4 1 3 362
22: 4 2 1 145
23: 4 2 2 499
24: 4 2 3 522
score group type weightedCOUNT

How to rank a column with a condition

I have a data frame :
dt <- read.table(text = "
1 390
1 366
1 276
1 112
2 97
2 198
2 400
2 402
3 110
3 625
4 137
4 49
4 9
4 578 ")
The first colomn is Index and the second is distance.
I want to add a colomn to rank the distance by Index in a descending order (the highest distance will be ranked first)
The result will be :
dt <- read.table(text = "
1 390 1
1 66 4
1 276 2
1 112 3
2 97 4
2 198 3
2 300 2
2 402 1
3 110 2
3 625 1
4 137 2
4 49 3
4 9 4
4 578 1")
Another R base approach
> dt$Rank <- unlist(tapply(-dt$V2, dt$V1, rank))
A tidyverse solution
dt %>%
group_by(V1) %>%
mutate(Rank=rank(-V2))
transform(dt,s = ave(-V2,V1,FUN = rank))
V1 V2 s
1 1 390 1
2 1 66 4
3 1 276 2
4 1 112 3
5 2 97 4
6 2 198 3
7 2 300 2
8 2 402 1
9 3 110 2
10 3 625 1
11 4 137 2
12 4 49 3
13 4 9 4
14 4 578 1
You could group, arrange, and rownumber. The result is a bit easier on the eyes than a simple rank, I think, and so worth an extra step.
dt %>%
group_by(V1) %>%
arrange(V1,desc(V2)) %>%
mutate(rank = row_number())
# A tibble: 14 x 3
# Groups: V1 [4]
V1 V2 rank
<int> <int> <int>
1 1 390 1
2 1 366 2
3 1 276 3
4 1 112 4
5 2 402 1
6 2 400 2
7 2 198 3
8 2 97 4
9 3 625 1
10 3 110 2
11 4 578 1
12 4 137 2
13 4 49 3
14 4 9 4
A scrambled alternative is min_rank
dt %>%
group_by(V1) %>%
mutate(min_rank(desc(V2)) )

How to keep initial row order

I have run this SQL sentence through the package: sqldf
SELECT A,B, COUNT(*) AS NUM
FROM DF
GROUP BY A,B
I have got the output I wanted, but I would like to keep the initial row order. Unfortunately, the output has a different order.
For example:
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
5 51 5 332 2
6 51 5 332 1
7 51 5 332 1
> sqldf("SELECT A,B,C,D, COUNT (*) AS NUM
+ FROM DF
+ GROUP BY A,B,C,D")
A B C D NUM
1 11 2 432 4 1
2 11 3 432 4 1
3 13 4 241 5 1
4 42 5 2 3 1
5 51 5 332 1 2
6 51 5 332 2 1
As you can see the row order changes, (row number 5 and 6). It would be great if someone could help me with this issue.
Regards,
If we need to use this with sqldf, use ORDER.BY with names pasted together
library(sqldf)
nm <- toString(names(DF))
DF1 <- cbind(rn = seq_len(nrow(DF)), DF)
nm1 <- toString(names(DF1))
fn$sqldf("SELECT $nm, COUNT (*) AS NUM
FROM DF1
GROUP BY $nm ORDER BY $nm1")
# A B C D NUM
#1 11 2 432 4 1
#2 11 3 432 4 1
#3 13 4 241 5 1
#4 42 5 2 3 1
#5 51 5 332 2 1
#6 51 5 332 1 2

How to do a generic order [duplicate]

This question already has an answer here:
How to sort a matrix/data.frame by all columns
(1 answer)
Closed 5 years ago.
I have a database as a data frame and I would like to order all columns, but keeping relations between elements.
For example, if I do the following:
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
5 51 5 332 2
6 51 5 332 1
7 51 5 332 1
> DF=DF[order(A,B,C,D),]
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
6 51 5 332 1
7 51 5 332 1
5 51 5 332 2
Ok, this is what I wanted (pay atention to the last two rows), but I would like to have a generic solution, independent of the number of columns. I have tried the following, but it does not work.
> DF=DF[order(colnames(DF)),]
> DF
A B C D
1 11 2 432 4
2 11 3 432 4
3 13 4 241 5
4 42 5 2 3
I would be grateful if someone could help me with this little issue. Regards.
We can use do.call with order for ordering on all the columns of a dataset
DF[do.call(order, DF),]
If we use tidyverse, there is arrange_at that will take column names
library(dplyr)
DF %>%
arrange_at(vars(names(.)))
#or as #Sotos commented
#arrange_all()
#or
#arrange(!!! rlang::syms(names(.)))
# A B C D
#1 11 2 432 4
#2 11 3 432 4
#3 13 4 241 5
#4 42 5 2 3
#5 51 5 332 1
#6 51 5 332 1
#7 51 5 332 2

Resources