Calculate ranks for each group - r

I have a df with types and values. I want to rank them in order of x within type and give a count of how many other rows row n has higher value of x than (column pos).
e.g.
df <- data.frame(type = c("a","a","a","b","b","b"),x=c(1,77,1,34,1,8))
# for type a row 3 has a higher x than row 1 and 2 so has a pos value of 2
I can do this with:
library(plyr)
df <- data.frame(type = c("a","a","a","b","b","b"),x=c(1,77,1,34,1,8))
df <- ddply(df,.(type), function(x) x[with(x, order(x)) ,])
df <- ddply(df,.(type), transform, pos = (seq_along(x)-1) )
type x pos
1 a 1 0
2 a 1 1
3 a 77 2
4 b 1 0
5 b 8 1
6 b 34 2
But this approach does not take into account ties between type a row 1 and 2. Whats the easiest way to get the output where ties have the same value e.g.
type x pos
1 a 1 0
2 a 1 0
3 a 77 2
4 b 1 0
5 b 8 1
6 b 34 2

ddply(df,.(type), transform, pos = rank(x,ties.method ="min")-1)
type x pos
1 a 1 0
2 a 77 2
3 a 1 0
4 b 34 2
5 b 1 0
6 b 8 1

Related

Assign sequential group ID given a group start indicator

I need to assign subgroup IDs given a group ID and an indicator showing the beginning of the new subgroup. Here's a test dataset:
group <- c(rep("A", 8), rep("B", 8))
x1 <- c(rep(0, 3), rep(1, 3), rep(0, 2))
x2 <- rep(0:1, 4)
df <- data.frame(group=group, indic=c(x1, x2))
Here is the resulting data frame:
df
group indic
1 A 0
2 A 0
3 A 0
4 A 1
5 A 1
6 A 1
7 A 0
8 A 0
9 B 0
10 B 1
11 B 0
12 B 1
13 B 0
14 B 1
15 B 0
16 B 1
indic==1 means that row is the beginning of a new subgroup, and the subgroup should be numbered 1 higher than the previous subgroup. Where indic==0 the subgroup should be the same as the previous subgroup. The subgroup numbering starts at 1. When the group variable changes, the subgroup numbering resets to 1. I would like to use the tidyverse framework.
Here is the result that I want:
df
group indic subgroup
1 A 0 1
2 A 0 1
3 A 0 1
4 A 1 2
5 A 1 3
6 A 1 4
7 A 0 4
8 A 0 4
9 B 0 1
10 B 1 2
11 B 0 2
12 B 1 3
13 B 0 3
14 B 1 4
15 B 0 4
16 B 1 5
I would like to be able to give some methods that I've tried already but didn't work, but I haven't been able to find anything even close. Any help will be appreciated.
You can just use
library(dplyr)
df %>% group_by(group) %>%
mutate(subgroup=cumsum(indic)+1)
# group indic subgroup
# <fct> <dbl> <dbl>
# 1 A 0 1
# 2 A 0 1
# 3 A 0 1
# 4 A 1 2
# 5 A 1 3
# 6 A 1 4
# 7 A 0 4
# 8 A 0 4
# 9 B 0 1
# 10 B 1 2
# 11 B 0 2
# 12 B 1 3
# 13 B 0 3
# 14 B 1 4
# 15 B 0 4
# 16 B 1 5
We use dplyr to do the grouping and then we just use cumsum with takes the cumulative sum of the indic column so each time it sees a 1 it increases.

Create equal length vectors from time series based upon factor in R

I have a data frame that is something like this:
time type count
1 -2 a 1
2 -1 a 4
3 0 a 6
4 1 a 2
5 2 a 5
6 0 b 3
7 1 b 7
8 2 b 2
I want to create a new data frame that takes type 'b' and creates the full time series by filling in zeroes for count. It should look like this:
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2
I can certainly subset(df, df$type = 'b') and then hack the beginning and rbind, but I want it to be more dynamic just in case the time vector changes.
We can use complete from tidyr to get the full 'time' for all the unique values of 'type' and filter the value of interest in 'type'.
library(tidyr)
library(dplyr)
val <- "b"
df1 %>%
complete(time, type, fill=list(count=0)) %>%
filter(type== val)
# time type count
# <int> <chr> <dbl>
#1 -2 b 0
#2 -1 b 0
#3 0 b 3
#4 1 b 7
#5 2 b 2
With base R:
df1 <- data.frame(time=df[df$type == 'a',]$time, type='b', count=0)
df1[match(df[df$type=='b',]$time, df1$time),]$count <- df[df$type=='b',]$count
df1
time type count
1 -2 b 0
2 -1 b 0
3 0 b 3
4 1 b 7
5 2 b 2

Creating new dataframe with missing value

i have a dataframe structured like this
time <- c(1,1,1,1,2,2)
group <- c('a','b','c','d','c','d')
number <- c(2,3,4,1,2,12)
df <- data.frame(time,group,number)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 c 2
6 2 d 12
in order to plot the data i need it to contain the values for each group (from a-d) at each time interval, even if they equal zero. so a data frame looking like this:
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a 0
6 2 b 0
7 2 c 2
8 2 d 12
any help?
You can use expand.grid and merge, like this:
> merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all = TRUE)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a NA
6 2 b NA
7 2 c 2
8 2 d 12
From there, it's just a simple matter of replacing NA with 0.
new <- merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all.y = TRUE)
new[is.na(new$number),"number"] <- 0
new

Cumulative Sum Starting at Center of Data Frame - R

I have this data.frame called dum
dummy <- data.frame(label = "a", x = c(1,1,1,1,0,1,1,1,1,1,1,1,1))
dummy1 <- data.frame(label = "b", x = c(1,1,1,1,1,1,1,1,0,1,1,1,1))
dum <- rbind(dummy,dummy1)
What I am trying to do is take the cumulative sum starting at 0 in the x column of dum. The summing would be grouped by the label column, which can be implemented in dplyr or plyr. The part that I am struggling with is how to start the cumulative sum from the 0 position in x and go outward.
The resulting data.frame should look like this :
>dum
label x output
1 a 1 4
2 a 1 3
3 a 1 2
4 a 1 1
5 a 0 0
6 a 1 1
7 a 1 2
8 a 1 3
9 a 1 4
10 a 1 5
11 a 1 6
12 a 1 7
13 a 1 8
14 b 1 8
15 b 1 7
16 b 1 6
17 b 1 5
18 b 1 4
19 b 1 3
20 b 1 2
21 b 1 1
22 b 0 0
23 b 1 1
24 b 1 2
25 b 1 3
26 b 1 4
This would need to be iterated thousands of times over millions of rows of data.
As usual, thanks for any and all help
It seems more like you just want to find the distance to a zero, rather than any sort of cumulative sum. If that's the case, then
#find zeros for each group
zeros <- tapply(seq.int(nrow(dum)) * as.numeric(dum$x==0), dum$label, max)
#calculate distance from zero for each point
dist <- abs(zeros[dum$label]-seq.int(nrow(dum)))
And that gives
cbind(dum, dist)
# label x dist
# 1 a 1 4
# 2 a 1 3
# 3 a 1 2
# 4 a 1 1
# 5 a 0 0
# 6 a 1 1
# 7 a 1 2
# 8 a 1 3
# 9 a 1 4
# 10 a 1 5
# 11 a 1 6
# 12 a 1 7
# 13 a 1 8
# 14 b 1 8
# 15 b 1 7
# 16 b 1 6
# 17 b 1 5
# 18 b 1 4
# 19 b 1 3
# 20 b 1 2
# 21 b 1 1
# 22 b 0 0
# 23 b 1 1
# 24 b 1 2
# 25 b 1 3
# 26 b 1 4
Or even ave will let you do it in one step
dist <- with(dum, ave(x,label,FUN=function(x) abs(seq_along(x)-which.min(x))))
cbind(dum, dist)
You can do this with by but also with plyr, data.table, etc. The function that is used on each subset is
f <- function(d) {
x <- d$x
i <- match(0, x)
v1 <- rev(cumsum(rev(x[1:i])))
v2 <- cumsum(x[(i+1):length(x)])
transform(d, output = c(v1, v2))
}
To call it on each subset e.g. with by
res <- by(dum, list(dum$label), f)
do.call(rbind, res)
If you want to use ddply
library(plyr)
ddply(dum, .(label), f)
May be faster with data.table
library(data.table)
dumdt <- as.data.table(dum)
setkey(dumdt, label)
dumdt[, f(.SD), by = key(dumdt)]
Using dplyr
library(dplyr)
dum%>%
group_by(label)%>%
mutate(dist=abs(row_number()-which.min(x)))

Subsequent row summing in dataframe object

I would like to do subsequent row summing of a columnvalue and put the result into a new columnvariable without deleting any row by another columnvalue .
Below is some R-code and an example that does the trick and hopefully illustrates my question. I was wondering if there is a more elegant way to do since the for loop will be time consuming in my actual object.
Thanks for any feedback.
As an example dataframe:
MyDf <- data.frame(ID = c(1,1,1,2,2,2), Y = 1:6)
MyDf$FIRST <- c(1,0,0,1,0,0)
MyDf.2 <- MyDf
MyDf.2$Y2 <- c(1,3,6,4,9,15)
The purpose of this is so that I can write code that calculates Y2 in MyDf.2 above for each ID, separately.
This is what I came up with and, it does the trick. (Calculating a TEST column in MyDf that has to be equal to Y2 cin MyDf.2)
MyDf$TEST <- NA
for(i in 1:length(MyDf$Y)){
MyDf[i,]$TEST <- ifelse(MyDf[i,]$FIRST == 1, MyDf[i,]$Y,MyDf[i,]$Y + MyDf[i-1,]$TEST)
}
MyDf
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
MyDf.2
ID Y FIRST Y2
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
You need ave and cumsum to get the column you want. transform is just to modify your existing data.frame.
> MyDf <- transform(MyDf, TEST=ave(Y, ID, FUN=cumsum))
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15

Resources