count the length of Number Sequences - r

Sample data containing some arithmetic sequences c(4,5,6) and c(10,11).
df <- data.frame(x = c(2, 4, 5, 6, 8, 10, 11))
What I want it is a new column that count the length of the each sequence, such as
> df
x cnt
1 2 1
2 4 1
3 5 2
4 6 3
5 8 1
6 10 1
7 11 2
It would be simple to first assign df$cnt[1] = 1, then for the second row and beyond just increment the count, or reset to 1 depending on if the consecutive numbers in df$x meet certain criteria (here x[i] - x[i-1] == 1). I am just not sure loop is the way to go in R-- also I need to deal with groups.
I can create new column to check if it is in a sequence. From there, I probably can use rle to calculate the run length and generate the cnt column (not sure how to do it with the NA).
> df %>% mutate(check=(x-lag(x)==1))
x check
1 2 NA
2 4 FALSE
3 5 TRUE
4 6 TRUE
5 8 FALSE
6 10 FALSE
7 11 TRUE
Is this the way to go? Please suggest solutions with dplyr or data.table?

dplyr. Set the default value and it will work:
df %>% mutate(check = x - lag(x, default = x[1L]) != 1) %>%
group_by(g = cumsum(check)) %>%
mutate(cnt = row_number()) %>%
ungroup %>% select(-g,-check)
x cnt
<dbl> <int>
1 2 1
2 4 1
3 5 2
4 6 3
5 8 1
6 10 1
7 11 2
data.table. Along the same lines and more concisely:
library(data.table)
setDT(df)
df[, cnt := 1:.N, by=cumsum(x != shift(x, fill=x[1L]) + 1L)]
x cnt
1: 2 1
2: 4 1
3: 5 2
4: 6 3
5: 8 1
6: 10 1
7: 11 2
shift is data.table's analogue to lag.
Alternately, from v1.9.7 of the package on, you're able to use rowid instead:
df[, cnt := rowid(cumsum(x != shift(x, fill=x[1L]) + 1L))]

Another option using base R
unlist(sapply(rle(cumsum(ifelse(diff(c(df$x[1],df$x))!=1,1,0)))$lengths,seq_len))

Related

how to subset a data frame up until a point R

i want to subset a data frame and take all observations for each id until the first observation that didn't meet my condition. Something like this:
goodDaysAfterTreatMent <- subset(Patientdays, treatmentDate < date & goodThings > badThings)
Except that this returns all observations that meet the condition. I want something that stops with the first observation that didn't meet the condition, moves on to the next id, and returns all observations for this id that meets the condition, and so on.
the only way i can see is to use a lot of loops but loops and that's usually not a god thing.
Hope you guys have an idea
Assume that your condition is to return rows where v < 5 :
# example dataset
df = data.frame(id = c(1,1,1,1,2,2,2,2,3,3,3),
v = c(2,4,3,5,4,5,6,7,5,4,1))
df
# id v
# 1 1 2
# 2 1 4
# 3 1 3
# 4 1 5
# 5 2 4
# 6 2 5
# 7 2 6
# 8 2 7
# 9 3 5
# 10 3 4
# 11 3 1
library(tidyverse)
df %>%
group_by(id) %>% # for each id
mutate(flag = cumsum(ifelse(v < 5, 1, NA))) %>% # check if v < 5 and fill with NA all rows when condition is FALSE and after that
filter(!is.na(flag)) %>% # keep only rows with no NA flags
ungroup() %>% # forget the grouping
select(-flag) # remove flag column
# # A tibble: 4 x 2
# id v
# <dbl> <dbl>
# 1 1 2
# 2 1 4
# 3 1 3
# 4 2 4
Easy way:
Find First FALSE by (min(which(condition == F)):
Patientdays<-cbind.data.frame(treatmentDate=c(1:5,4,6:10),date=c(2:5,3,6:10,10),goodThings=c(1:11),badThings=c(0:10))
attach(Patientdays)# Just due to ease of use (optional)
condition<-treatmentDate < date & goodThings > badThings
Patientdays[1:(min(which(condition == F))-1),]
Edit: Adding result.
treatmentDate date goodThings badThings
1 1 2 1 0
2 2 3 2 1
3 3 4 3 2
4 4 5 4 3

How to select unique point

I am a novice R programmer. I have a following series of points.
df <- data.frame(x = c(1 , 2, 3, 4), y = c(6 , 3, 7, 5))
df <- df %>% mutate(k = 1)
df <- df %>% full_join(df, by = 'k')
df <- subset(df, select = c('x.x', 'y.x', 'x.y', 'y.y'))
df
Is there way to select for "unique" points? (the order of the points do not matter)
EDIT:
x.x y.x x.y y.y
1 6 2 3
2 3 3 7
.
.
.
(I changed the 2 to 7 to clarify the problem)
With data.table (and working from the OP's initial df):
library(data.table)
setDT(df)
df[, r := .I ]
df[df, on=.(r > r), nomatch=0]
x y r i.x i.y
1: 2 3 1 1 6
2: 3 2 1 1 6
3: 4 5 1 1 6
4: 3 2 2 2 3
5: 4 5 2 2 3
6: 4 5 3 3 2
This is a "non-equi join" on row numbers. In x[i, on=.(r > r)] the left-hand r refers to the row in x and the right-hand one to a row of i. The columns named like i.* are taken from i.
Data.table joins, which are of the form x[i], use i to look up rows of x. The nomatch=0 option drops rows of i that find no matches.
In the tidyverse, you can save a bit of work by doing the self-join with tidyr::crossing. If you add row indices pre-join, reducing is a simple filter call:
library(tidyverse)
df %>% mutate(i = row_number()) %>% # add row index column
crossing(., .) %>% # Cartesian self-join
filter(i < i1) %>% # reduce to lower indices
select(-i, -i1) # remove extraneous columns
## x y x1 y1
## 1 1 6 2 3
## 2 1 6 3 7
## 3 1 6 4 5
## 4 2 3 3 7
## 5 2 3 4 5
## 6 3 7 4 5
or in all base R,
df$m <- 1
df$i <- seq(nrow(df))
df <- merge(df, df, by = 'm')
df[df$i.x < df$i.y, c(-1, -4, -7)]
## x.x y.x x.y y.y
## 2 1 6 2 3
## 3 1 6 3 7
## 4 1 6 4 5
## 7 2 3 3 7
## 8 2 3 4 5
## 12 3 7 4 5
You can use the duplicated.matrix() function from base, to find the rows which are no duplicator - which means in fact that there are unique. When you call the duplicated() function you have to clarify that you only want to use the to first colons. With this call you check which line is unique. In a second step you call in your dataframe for this rows, with all columns.
unique_lines = !duplicated.matrix(df[,c(1,2)])
df[unique_lines,]

Conditional cumsum with reset

I have a data frame, the data frame is already sorted as needed, but now I will like to "slice it" in groups.
This groups should have a max cumulative value of 10. When the cumulative value is > 10, it should reset the cumulative sum and start over again
library(dplyr)
id <- sample(1:15)
order <- 1:15
value <- c(4, 5, 7, 3, 8, 1, 2, 5, 3, 6, 2, 6, 3, 1, 4)
df <- data.frame(id, order, value)
df
This is the output I'm looking for(I did it "manually")
cumsum_10 <- c(4, 9, 7, 10, 8, 9, 2, 7, 10, 6, 8, 6, 9, 10, 4)
group_10 <- c(1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7)
df1 <- data.frame(df, cumsum_10, group_10)
df1
So I'm having 2 problems
How to create a cumulative variable that resets everytime it passes an upper limit (10 in this case)
How to count/group every group
For the first part I was trying some combinations of group_by and cumsum with no luck
df1 <- df %>% group_by(cumsum(c(False, value < 10)))
I would prefer a pipe (%>%) solution instead of a for loop
Thanks
I think this is not easily vectorizable.... at least i do not know how.
You can do it by hand via:
my_cumsum <- function(x){
grp = integer(length(x))
grp[1] = 1
for(i in 2:length(x)){
if(x[i-1] + x[i] <= 10){
grp[i] = grp[i-1]
x[i] = x[i-1] + x[i]
} else {
grp[i] = grp[i-1] + 1
}
}
data.frame(grp, x)
}
For your data this gives:
> my_cumsum(df$value)
grp x
1 1 4
2 1 9
3 2 7
4 2 10
5 3 8
6 3 9
7 4 2
8 4 7
9 4 10
10 5 6
11 5 8
12 6 6
13 6 9
14 6 10
15 7 4
Also for my "counter-example" this gives:
> my_cumsum(c(10,6,4))
grp x
1 1 10
2 2 6
3 2 10
As #Khashaa pointed out this can be implementet more efficiently via Rcpp. He linked to this answer How to speed up or vectorize a for loop? which i find very useful
You could define your own function and then use it inside dplyr's mutate statement as follows:
df %>% group_by() %>%
mutate(
cumsum_10 = cumsum_with_reset(value, 10),
group_10 = cumsum_with_reset_group(value, 10)
) %>%
ungroup()
The cumsum_with_reset() function takes a column and a threshold value which resets the sum. cumsum_with_reset_group() is similar but identifies rows that have been grouped together. Definitions are as follows:
# group rows based on cumsum with reset
cumsum_with_reset_group <- function(x, threshold) {
cumsum <- 0
group <- 1
result <- numeric()
for (i in 1:length(x)) {
cumsum <- cumsum + x[i]
if (cumsum > threshold) {
group <- group + 1
cumsum <- x[i]
}
result = c(result, group)
}
return (result)
}
# cumsum with reset
cumsum_with_reset <- function(x, threshold) {
cumsum <- 0
group <- 1
result <- numeric()
for (i in 1:length(x)) {
cumsum <- cumsum + x[i]
if (cumsum > threshold) {
group <- group + 1
cumsum <- x[i]
}
result = c(result, cumsum)
}
return (result)
}
# use functions above as window functions inside mutate statement
df %>% group_by() %>%
mutate(
cumsum_10 = cumsum_with_reset(value, 10),
group_10 = cumsum_with_reset_group(value, 10)
) %>%
ungroup()
This can be done easily with purrr::accumulate
library(dplyr)
library(purrr)
df %>% mutate(cumsum_10 = accumulate(value, ~ifelse(.x + .y <= 10, .x + .y, .y)),
group_10 = cumsum(value == cumsum_10))
id order value cumsum_10 group_10
1 8 1 4 4 1
2 13 2 5 9 1
3 7 3 7 7 2
4 1 4 3 10 2
5 4 5 8 8 3
6 10 6 1 9 3
7 12 7 2 2 4
8 2 8 5 7 4
9 15 9 3 10 4
10 11 10 6 6 5
11 14 11 2 8 5
12 3 12 6 6 6
13 5 13 3 9 6
14 9 14 1 10 6
15 6 15 4 4 7
We can take advantage of the function cumsumbinning, from the package MESS, that performs this task:
library(MESS)
df %>%
group_by(group_10 = cumsumbinning(value, 10)) %>%
mutate(cumsum_10 = cumsum(value))
Output
# A tibble: 15 x 5
# Groups: group_10 [7]
id order value group_10 cumsum_10
<int> <int> <dbl> <int> <dbl>
1 6 1 4 1 4
2 10 2 5 1 9
3 1 3 7 2 7
4 5 4 3 2 10
5 3 5 8 3 8
6 9 6 1 3 9
7 14 7 2 4 2
8 11 8 5 4 7
9 15 9 3 4 10
10 8 10 6 5 6
11 12 11 2 5 8
12 2 12 6 6 6
13 4 13 3 6 9
14 7 14 1 6 10
15 13 15 4 7 4
The function below uses recursion to construct a vector with the lengths of each group. It is faster than a loop for small data vectors (length less than about a hundred values), but slower for longer ones. It takes three arguments:
1) vec: A vector of values that we want to group.
2) i: The index of the starting position in vec.
3) glv: A vector of group lengths. This is the return value, but we need to initialize it and pass it along through each recursion.
# Group a vector based on consecutive values with a cumulative sum <= 10
gf = function(vec, i, glv) {
## Break out of the recursion when we get to the last group
if (sum(vec[i:length(vec)]) <= 10) {
glv = c(glv, length(i:length(vec)))
return(glv)
}
## Keep recursion going if there are at least two groups left
# Calculate length of current group
gl = sum(cumsum(vec[i:length(vec)]) <= 10)
# Append to previous group lengths
glv.append = c(glv, gl)
# Call function recursively
gf(vec, i + gl, glv.append)
}
Run the function to return a vector of group lengths:
group_vec = gf(df$value, 1, numeric(0))
[1] 2 2 2 3 2 3 1
To add a column to df with the group lengths, use rep:
df$group10 = rep(1:length(group_vec), group_vec)
In its current form the function will only work on vectors that don't have any values greater than 10, and the grouping by sums <= 10 is hard-coded. The function can of course be generalized to deal with these limitations.
The function can be speeded up somewhat by doing cumulative sums that look ahead only a certain number of values, rather than the remaining length of the vector. For example, if the values are always positive, you only need to look ten values ahead, since you'll never need to sum more than ten numbers to reach a value of 10. This too can be generalized for any target value. Even with this modification, the function is still slower than a loop for a vector with more than about a hundred values.
I haven't worked with recursive functions in R before and would be interested in any comments and suggestions on whether recursion makes sense for this type of problem and whether it can be improved, especially execution speed.

Replacing the last value within groups with different values

My question is similar to this post, but the difference is instead of replacing the last value within each group/id with all 0's, different values are used to replace the last value within each group/id.
Here is an example (I borrowed it from the above link):
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 9999
6 2 0
7 2 9
8 2 500
9 3 0
10 3 1
In the above link, the last value within each group/id was replaced by a zero, using something like:
df %>%
group_by(id) %>%
mutate(Time = c(Time[-n()], 0))
And the output was
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 0
6 2 0
7 2 9
8 2 0
9 3 0
10 3 0
In my case, I would like the last value within each group/id to be replaced by a different value. Originally, the last value within each group/id was 9999, 500, and 1. Now I would like: 9999 is replaced by 5, 500 is replaced by 12, and 1 is replaced by 92. The desired output is:
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 5
6 2 0
7 2 9
8 2 12
9 3 0
10 3 92
I tried this one:
df %>%
group_by(id) %>%
mutate(Time = replace(Time, n(), c(5,12,92))),
but it did not work.
This could be solved using almost identical solution as I posted in the linked question. e.g., just replace 0L with the desired values
library(data.table)
indx <- setDT(df)[, .I[.N], by = id]$V1
df[indx, Time := c(5L, 12L, 92L)]
df
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 92
So to add some explanations:
.I is identical to row_number() or 1:n() in dplyr for an ungrouped data, e.g. 1:nrow(df) in base R
.N is like n() in dplyr, e.g., the size of a certain group (or the whole data set). So basically when I run .I[.N] by group, I'm retrieving the global index of the last row of each group
The next step is just use this index as a row index within df while assigning the desired values to Time by reference using the := operator.
Edit
Per OPs request, here's a possible dplyr solution. Your original solution doesn't work because you are working per group and thus you were trying to pass all three values to each group.
The only way I can think of is to first calculate group sizes, then ungroup and then mutate on the cumulative sum of these locations, something among these lines
library(dplyr)
df %>%
group_by(id) %>%
mutate(indx = n()) %>%
ungroup() %>%
mutate(Time = replace(Time, cumsum(unique(indx)), c(5, 12, 92))) %>%
select(-indx)
# Source: local data frame [10 x 2]
#
# id Time
# 1 1 3
# 2 1 10
# 3 1 1
# 4 1 0
# 5 1 5
# 6 2 0
# 7 2 9
# 8 2 12
# 9 3 0
# 10 3 92
Another way using data.table would be to create another data.table which contains the values to be replaced with for a given id, and then join and update by reference (simultaneously).
require(data.table) # v1.9.5+ (for 'on = ' feature)
replace = data.table(id = 1:3, val = c(5L, 12L, 9L)) # from #David
setDT(df)[replace, Time := val, on = "id", mult = "last"]
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 9
In data.table, joins are considered as an extension of subsets. It's natural to think of doing whatever operation we do on subsets also on joins. Both operations do something on some rows.
For each replace$id, we find the last matching row (mult = "last") in df$id, and update that row with the corresponding val.
Installation instructions for v1.9.5 here. Hope this helps.

R: subset/group data frame with a max value?

Given a data frame like this:
gid set a b
1 1 1 1 9
2 1 2 -2 -3
3 1 3 5 6
4 2 2 -4 -7
5 2 6 5 10
6 2 9 2 0
How can I subset/group data frame of a unique gid with the max set value and 1/0 wether its a value is greater than its b value?
So here, it'd be, uh...
1,3,0
2,9,1
Kind of a stupid simple thing in SQL but I'd like to have a bit better control over my R, so...
Piece of cake with dplyr:
dat <- read.table(text="gid set a b
1 1 1 9
1 2 -2 -3
1 3 5 6
2 2 -4 -7
2 6 5 10
2 9 2 0", header=TRUE)
library(dplyr)
dat %>%
group_by(gid) %>%
filter(row_number() == which.max(set)) %>%
mutate(greater=a>b) %>%
select(gid, set, greater)
## Source: local data frame [2 x 3]
## Groups: gid
##
## gid set greater
## 1 1 3 FALSE
## 2 2 9 TRUE
If you really need 1's and 0's and the dplyr groups cause any angst:
dat %>%
group_by(gid) %>%
filter(row_number() == which.max(set)) %>%
mutate(greater=ifelse(a>b, 1, 0)) %>%
select(gid, set, greater) %>%
ungroup
## Source: local data frame [2 x 3]
##
## gid set greater
## 1 1 3 0
## 2 2 9 1
You could do the same thing without pipes:
ungroup(
select(
mutate(
filter(row_number() == which.max(set)),
greater=ifelse(a>b, 1, 0)), gid, set, greater))
but…but… why?! :-)
Here's a data.table possibility, assuming your original data is called df.
library(data.table)
setDT(df)[, .(set = max(set), b = as.integer(a > b)[set == max(set)]), gid]
# gid set b
# 1: 1 3 0
# 2: 2 9 1
Note that to account for multiple max(set) rows, I used set == max(set) as the subset so that this will return the same number of rows for which there are ties for the max (if that makes any sense at all).
And courtesy of #thelatemail, another data table option:
setDT(df)[, list(set = max(set), ab = (a > b)[which.max(set)] + 0), by = gid]
# gid set ab
# 1: 1 3 0
# 2: 2 9 1
In base R, you can use ave
indx <- with(df, ave(set, gid, FUN=max)==set)
#in cases of ties
#indx <- with(df, !!ave(set, gid, FUN=function(x)
# which.max(x) ==seq_along(x)))
transform(df[indx,], greater=(a>b)+0)[,c(1:2,5)]
# gid set greater
# 3 1 3 0
# 6 2 9 1

Resources