Grouping over 2 columns and use values of subsequent groups in calculations - r

Suppose I have a df with 3 columns, group1, group2 & variable
set.seed(1)
group1 = c(rep(1,5),rep(2,5),rep(3,5),rep(4,5))
group2 = c("A","B","C","D","B","C","C","B","C","A","B","D")
variable = c(as.integer(rnorm(20,2)**3))
df=data.frame(group1, group2, variable)
I added the column 'min1' which states if the value of b within 'group1' is also present in group1(x-1). Vice Versa with plus1. Below the total data frame:
group1 group2 variable min1 plus1
1 1 A 3 0 0
2 1 B 11 0 1
3 1 C 2 0 1
4 2 D 47 0 1
5 2 B 13 1 1
6 2 C 2 1 1
7 3 C 16 1 0
8 3 B 21 1 1
9 3 C 18 1 0
10 4 A 5 0 0
11 4 B 44 1 0
12 4 D 14 0 0
Now I want to do calculations such as max() and sum() (but also some more exotic ones) on the variables but not just on all values within their own group1 & group2 combination, but including the values of the group before (or after it). The min1 example is shown below.
group1_min1 group2_min1 sum_min1 max_min1
1 2 B 24 13
2 2 C 4 2
3 3 C 36 18
4 3 B 34 21
5 4 B 65 44
Note that for group1_min1(3),group2_min1(C) three values are used: rows 6,7&9 (2,16&18).
I tried using group_by and summarize within dplyr, something like:
group_by(group1, group2) %>%
summarize_each(funs(sum, max))
EDIT:
I found a solution to add the sum to the original df:
sum_min1 = c()
j=0
for (j in 1:(length(df$group1))){
if (df[j,"min1"] == 0){sum_min1 = c(sum_min1,0)} else {
sum_min1 = c(sum_min1,(sum(df[which((df[,"group1"] == df[j,"group1"] | df[,"group1"] == (df[j,"group1"]-1)) & df[,"group2"]==(df[j,"group2"])),"variable"])))
}
}
df = cbind(df,sum_min1)
This delivers the output:
group1 group2 variable min1 plus1 sum_min1
1 1 A 3 0 0 0
2 1 B 11 0 1 0
3 1 C 2 0 1 0
4 2 D 47 0 0 0
5 2 B 13 1 1 24
6 2 C 2 1 1 4
7 3 C 16 1 0 36
8 3 B 21 1 1 34
9 3 C 18 1 0 36
10 4 A 5 0 0 0
11 4 B 44 1 0 65
12 4 D 14 0 0 0
However this seems to be a very crude way and may take long on big data sets, also in reality there are multiple variables and multiple functions. Also it might be a problem because I want to do some user-defined functions which include a for loop for all the values.
Is there a more elegant way to do this?
Sorry for anything I do wrong, I am new to R and StackOverflow and not a native speaker.

# Data
set.seed(1)
group1 = c(rep(1,3),rep(2,3),rep(3,3),rep(4,3))
group2 = c("A","B","C","D","B","C","C","B","C","A","B","D")
variable = c(as.integer(rnorm(12,2)**3))
df=data.frame(group1, group2, variable)
For the first part-
df$min1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "group1"] == 1){0} else {
max(df[x, "group2"] %in% df[df$group1 == df[x,"group1"] - 1,"group2"])}
})
df$plus1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "group1"] == max(df$group1){0} else {
max(df[x, "group2"] %in% df[df$group1 == df[x,"group1"] + 1,"group2"])}
})
Second part
df$sum_min1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "group1"] == 1){0}else{
sum(df[df$group1 == df[x,"group1"] &
df$group2 == df[x,"group2"],"variable"],
df[df$group1 == df[x,"group1"] - 1 &
df$group2 == df[x,"group2"],"variable"])}
})

Related

Assign sequential group ID given a group start indicator

I need to assign subgroup IDs given a group ID and an indicator showing the beginning of the new subgroup. Here's a test dataset:
group <- c(rep("A", 8), rep("B", 8))
x1 <- c(rep(0, 3), rep(1, 3), rep(0, 2))
x2 <- rep(0:1, 4)
df <- data.frame(group=group, indic=c(x1, x2))
Here is the resulting data frame:
df
group indic
1 A 0
2 A 0
3 A 0
4 A 1
5 A 1
6 A 1
7 A 0
8 A 0
9 B 0
10 B 1
11 B 0
12 B 1
13 B 0
14 B 1
15 B 0
16 B 1
indic==1 means that row is the beginning of a new subgroup, and the subgroup should be numbered 1 higher than the previous subgroup. Where indic==0 the subgroup should be the same as the previous subgroup. The subgroup numbering starts at 1. When the group variable changes, the subgroup numbering resets to 1. I would like to use the tidyverse framework.
Here is the result that I want:
df
group indic subgroup
1 A 0 1
2 A 0 1
3 A 0 1
4 A 1 2
5 A 1 3
6 A 1 4
7 A 0 4
8 A 0 4
9 B 0 1
10 B 1 2
11 B 0 2
12 B 1 3
13 B 0 3
14 B 1 4
15 B 0 4
16 B 1 5
I would like to be able to give some methods that I've tried already but didn't work, but I haven't been able to find anything even close. Any help will be appreciated.
You can just use
library(dplyr)
df %>% group_by(group) %>%
mutate(subgroup=cumsum(indic)+1)
# group indic subgroup
# <fct> <dbl> <dbl>
# 1 A 0 1
# 2 A 0 1
# 3 A 0 1
# 4 A 1 2
# 5 A 1 3
# 6 A 1 4
# 7 A 0 4
# 8 A 0 4
# 9 B 0 1
# 10 B 1 2
# 11 B 0 2
# 12 B 1 3
# 13 B 0 3
# 14 B 1 4
# 15 B 0 4
# 16 B 1 5
We use dplyr to do the grouping and then we just use cumsum with takes the cumulative sum of the indic column so each time it sees a 1 it increases.

Creating an ifelse statement that is conditional by factor level

I'm new to R, so I'm sorry if this is obvious. But, I've been stuck on this for a while, but can have been fruitless in finding answers thus far.
Data frame:
1 b c id e
2 0 1 45 5
3 1 0 45 7
4 0 1 48 5
5 1 0 46 7
Desired result:
1 b c id e f
2 0 1 45 5 1
3 1 0 45 7 1
4 0 1 48 5 0
5 1 0 46 7 0
What I'm trying to do: I am trying to create column F based on levels of b and c for people with the same ID. Column E is still important to me along with other omitted values, so I can't collapse the data on ID.
The closest I've gotten:
library(dplyr)
df2 <- df %>%
group_by(id) %>%
mutate(ifelse(b == 1 & c == 1, 1, 0))
But, I think my problem there is that I'm not using dplyr::group_by correctly so I'm essentially doing a base ifelse statement.
We don't need an ifelse here
df %>%
group_by(id) %>%
mutate(f = as.integer(any(b) & any(c)))
# A tibble: 4 x 5
# Groups: id [3]
# b c id e f
# <int> <int> <int> <int> <int>
#1 0 1 45 5 1
#2 1 0 45 7 1
#3 0 1 48 5 0
#4 1 0 46 7 0

Perform multiple calculations based on multiple conditions on multiple columns in R

I working on a dataframe with timeslots, IDs and multiple variables.
set.seed(1)
timeslot = c(rep(1,3),rep(2,3),rep(3,3),rep(4,3))
ID = c("A","B","C","D","B","C","C","B","C","A","B","D")
variable1 = c(as.integer(rnorm(12,2)**3)-1)
variable2 = c(as.integer(rnorm(12,4)**2)+1)
df = data.frame(timeslot,ID,variable1,variable2)
The dataframe:
timeslot ID variable1 variable2
1 1 A 1 12
2 1 B 9 4
3 1 C 0 27
4 2 D 45 16
5 2 B 11 16
6 2 C 0 25
7 3 C 14 24
8 3 B 19 22
9 3 C 16 25
10 4 A 3 23
11 4 B 42 17
12 4 D 12 5
I want to perform some calculations on specific rows. For each row it looks whether that specific ID (A,B,C or D) is present not only in its own timeslot, but also in the previous timeslot. Below is the column added via the following code (Thanks to code_is_entropy:
df$min1 <- sapply(seq(nrow(df)), function(x)
{
if(df[x, "timeslot"] == 1){0} else {
max(df[x, "ID"] %in% df[df$timeslot == df[x,"timeslot"] - 1,"ID"])}
})
timeslot ID variable1 variable2 min1
1 1 A 1 12 0
2 1 B 9 4 0
3 1 C 0 27 0
4 2 D 45 16 0
5 2 B 11 16 1
6 2 C 0 25 1
7 3 C 14 24 1
8 3 B 19 22 1
9 3 C 16 25 1
10 4 A 3 23 0
11 4 B 42 17 1
12 4 D 12 5 0
For these rows calculations should be made. The other rows can have value 0. I want to include several different functions (also 'home-made'). These functions should run over all variables (in reality more than 2).
I succeeded in doing so for the functions sum, max, count &count_not_zero by using this code:
sumprev = c()
maxprev = c()
lengthprev = c()
lengthnotzeroprev = c()
j=0
for (j in 1:(length(df$timeslot))){
if (df[j,"min1"] == 0){sumprev = c(sumprev,0); maxprev = c(maxprev,0); lengthnotzeroprev = c(lengthnotzeroprev,0); lengthprev = c(lengthprev,0)} else {
sumprev = c(sumprev,(sum(df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"])))
maxprev = c(maxprev,(max(df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"])))
lengthprev = c(lengthprev,(length(df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"])))
lengthnotzeroprev = c(lengthnotzeroprev,(length(df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"][df[which((df[,"timeslot"] == df[j,"timeslot"] | df[,"timeslot"] == (df[j,"timeslot"]-1)) & df[,"ID"]==(df[j,"ID"])),"variable1"]])))
}
}
df = cbind(df,sumprev,maxprev, lengthprev,lengthnotzeroprev)
timeslot ID variable1 variable2 min1 sumprev maxprev lengthprev lengthnotzeroprev
1 1 A 1 12 0 0 0 0 0
2 1 B 9 4 0 0 0 0 0
3 1 C 0 27 0 0 0 0 0
4 2 D 45 16 0 0 0 0 0
5 2 B 11 16 1 20 11 2 2
6 2 C 0 25 1 0 0 2 0
7 3 C 14 24 1 30 16 3 2
8 3 B 19 22 1 30 19 2 2
9 3 C 16 25 1 30 16 3 2
10 4 A 3 23 0 0 0 0 0
11 4 B 42 17 1 61 42 2 2
12 4 D 12 5 0 0 0 0 0
This is rather crude and I can't figure out how to do this easily for multiple functions and for more than 1 variable.
I struggled with data.table package to get something but I can't seem to work out how to get it done.
Sorry for anything I do wrong, I am new to R and StackOverflow and not a native speaker.

Cumulative Sum Starting at Center of Data Frame - R

I have this data.frame called dum
dummy <- data.frame(label = "a", x = c(1,1,1,1,0,1,1,1,1,1,1,1,1))
dummy1 <- data.frame(label = "b", x = c(1,1,1,1,1,1,1,1,0,1,1,1,1))
dum <- rbind(dummy,dummy1)
What I am trying to do is take the cumulative sum starting at 0 in the x column of dum. The summing would be grouped by the label column, which can be implemented in dplyr or plyr. The part that I am struggling with is how to start the cumulative sum from the 0 position in x and go outward.
The resulting data.frame should look like this :
>dum
label x output
1 a 1 4
2 a 1 3
3 a 1 2
4 a 1 1
5 a 0 0
6 a 1 1
7 a 1 2
8 a 1 3
9 a 1 4
10 a 1 5
11 a 1 6
12 a 1 7
13 a 1 8
14 b 1 8
15 b 1 7
16 b 1 6
17 b 1 5
18 b 1 4
19 b 1 3
20 b 1 2
21 b 1 1
22 b 0 0
23 b 1 1
24 b 1 2
25 b 1 3
26 b 1 4
This would need to be iterated thousands of times over millions of rows of data.
As usual, thanks for any and all help
It seems more like you just want to find the distance to a zero, rather than any sort of cumulative sum. If that's the case, then
#find zeros for each group
zeros <- tapply(seq.int(nrow(dum)) * as.numeric(dum$x==0), dum$label, max)
#calculate distance from zero for each point
dist <- abs(zeros[dum$label]-seq.int(nrow(dum)))
And that gives
cbind(dum, dist)
# label x dist
# 1 a 1 4
# 2 a 1 3
# 3 a 1 2
# 4 a 1 1
# 5 a 0 0
# 6 a 1 1
# 7 a 1 2
# 8 a 1 3
# 9 a 1 4
# 10 a 1 5
# 11 a 1 6
# 12 a 1 7
# 13 a 1 8
# 14 b 1 8
# 15 b 1 7
# 16 b 1 6
# 17 b 1 5
# 18 b 1 4
# 19 b 1 3
# 20 b 1 2
# 21 b 1 1
# 22 b 0 0
# 23 b 1 1
# 24 b 1 2
# 25 b 1 3
# 26 b 1 4
Or even ave will let you do it in one step
dist <- with(dum, ave(x,label,FUN=function(x) abs(seq_along(x)-which.min(x))))
cbind(dum, dist)
You can do this with by but also with plyr, data.table, etc. The function that is used on each subset is
f <- function(d) {
x <- d$x
i <- match(0, x)
v1 <- rev(cumsum(rev(x[1:i])))
v2 <- cumsum(x[(i+1):length(x)])
transform(d, output = c(v1, v2))
}
To call it on each subset e.g. with by
res <- by(dum, list(dum$label), f)
do.call(rbind, res)
If you want to use ddply
library(plyr)
ddply(dum, .(label), f)
May be faster with data.table
library(data.table)
dumdt <- as.data.table(dum)
setkey(dumdt, label)
dumdt[, f(.SD), by = key(dumdt)]
Using dplyr
library(dplyr)
dum%>%
group_by(label)%>%
mutate(dist=abs(row_number()-which.min(x)))

Calculate ranks for each group

I have a df with types and values. I want to rank them in order of x within type and give a count of how many other rows row n has higher value of x than (column pos).
e.g.
df <- data.frame(type = c("a","a","a","b","b","b"),x=c(1,77,1,34,1,8))
# for type a row 3 has a higher x than row 1 and 2 so has a pos value of 2
I can do this with:
library(plyr)
df <- data.frame(type = c("a","a","a","b","b","b"),x=c(1,77,1,34,1,8))
df <- ddply(df,.(type), function(x) x[with(x, order(x)) ,])
df <- ddply(df,.(type), transform, pos = (seq_along(x)-1) )
type x pos
1 a 1 0
2 a 1 1
3 a 77 2
4 b 1 0
5 b 8 1
6 b 34 2
But this approach does not take into account ties between type a row 1 and 2. Whats the easiest way to get the output where ties have the same value e.g.
type x pos
1 a 1 0
2 a 1 0
3 a 77 2
4 b 1 0
5 b 8 1
6 b 34 2
ddply(df,.(type), transform, pos = rank(x,ties.method ="min")-1)
type x pos
1 a 1 0
2 a 77 2
3 a 1 0
4 b 34 2
5 b 1 0
6 b 8 1

Resources