Calculate difference between values in consecutive rows by group - r

This is a my df (data.frame):
group value
1 10
1 20
1 25
2 5
2 10
2 15
I need to calculate difference between values in consecutive rows by group.
So, I need a that result.
group value diff
1 10 NA # because there is a no previous value
1 20 10 # value[2] - value[1]
1 25 5 # value[3] value[2]
2 5 NA # because group is changed
2 10 5 # value[5] - value[4]
2 15 5 # value[6] - value[5]
Although, I can handle this problem by using ddply, but it takes too much time. This is because I have a lot of groups in my df. (over 1,000,000 groups in my df)
Are there any other effective approaches to handle this problem?

The package data.table can do this fairly quickly, using the shift function.
require(data.table)
df <- data.table(group = rep(c(1, 2), each = 3), value = c(10,20,25,5,10,15))
#setDT(df) #if df is already a data frame
df[ , diff := value - shift(value), by = group]
# group value diff
#1: 1 10 NA
#2: 1 20 10
#3: 1 25 5
#4: 2 5 NA
#5: 2 10 5
#6: 2 15 5
setDF(df) #if you want to convert back to old data.frame syntax
Or using the lag function in dplyr
df %>%
group_by(group) %>%
mutate(Diff = value - lag(value))
# group value Diff
# <int> <int> <int>
# 1 1 10 NA
# 2 1 20 10
# 3 1 25 5
# 4 2 5 NA
# 5 2 10 5
# 6 2 15 5
For alternatives pre-data.table::shift and pre-dplyr::lag, see edits.

You can use the base function ave() for this
df <- data.frame(group=rep(c(1,2),each=3),value=c(10,20,25,5,10,15))
df$diff <- ave(df$value, factor(df$group), FUN=function(x) c(NA,diff(x)))
which returns
group value diff
1 1 10 NA
2 1 20 10
3 1 25 5
4 2 5 NA
5 2 10 5
6 2 15 5

try this with tapply
df$diff<-as.vector(unlist(tapply(df$value,df$group,FUN=function(x){ return (c(NA,diff(x)))})))

Since dplyr 1.1.0, you can shorten the dplyr version with inline temporary grouping with .by:
mutate(df, diff = value - lag(value), .by = group)

Related

Calculate a new column based on the difference between current row and previous row in another variable [duplicate]

This is a my df (data.frame):
group value
1 10
1 20
1 25
2 5
2 10
2 15
I need to calculate difference between values in consecutive rows by group.
So, I need a that result.
group value diff
1 10 NA # because there is a no previous value
1 20 10 # value[2] - value[1]
1 25 5 # value[3] value[2]
2 5 NA # because group is changed
2 10 5 # value[5] - value[4]
2 15 5 # value[6] - value[5]
Although, I can handle this problem by using ddply, but it takes too much time. This is because I have a lot of groups in my df. (over 1,000,000 groups in my df)
Are there any other effective approaches to handle this problem?
The package data.table can do this fairly quickly, using the shift function.
require(data.table)
df <- data.table(group = rep(c(1, 2), each = 3), value = c(10,20,25,5,10,15))
#setDT(df) #if df is already a data frame
df[ , diff := value - shift(value), by = group]
# group value diff
#1: 1 10 NA
#2: 1 20 10
#3: 1 25 5
#4: 2 5 NA
#5: 2 10 5
#6: 2 15 5
setDF(df) #if you want to convert back to old data.frame syntax
Or using the lag function in dplyr
df %>%
group_by(group) %>%
mutate(Diff = value - lag(value))
# group value Diff
# <int> <int> <int>
# 1 1 10 NA
# 2 1 20 10
# 3 1 25 5
# 4 2 5 NA
# 5 2 10 5
# 6 2 15 5
For alternatives pre-data.table::shift and pre-dplyr::lag, see edits.
You can use the base function ave() for this
df <- data.frame(group=rep(c(1,2),each=3),value=c(10,20,25,5,10,15))
df$diff <- ave(df$value, factor(df$group), FUN=function(x) c(NA,diff(x)))
which returns
group value diff
1 1 10 NA
2 1 20 10
3 1 25 5
4 2 5 NA
5 2 10 5
6 2 15 5
try this with tapply
df$diff<-as.vector(unlist(tapply(df$value,df$group,FUN=function(x){ return (c(NA,diff(x)))})))
Since dplyr 1.1.0, you can shorten the dplyr version with inline temporary grouping with .by:
mutate(df, diff = value - lag(value), .by = group)

R - delete consecutive (ONLY) duplicates

I need to eliminate rows from a data frame based on the repetition of values in a given column, but only those that are consecutive.
For example, for the following data frame:
df = data.frame(x=c(1,1,1,2,2,4,2,2,1))
df$y <- c(10,11,30,12,49,13,12,49,30)
df$z <- c(1,2,3,4,5,6,7,8,9)
x y z
1 10 1
1 11 2
1 30 3
2 12 4
2 49 5
4 13 6
2 12 7
2 49 8
1 30 9
I would need to eliminate rows with consecutive repeated values in the x column, keep the last repeated row, and maintain the structure of the data frame:
x y z
1 30 3
2 49 5
4 13 6
2 49 8
1 30 9
Following directions from help and some other posts, I have tried using the duplicated function:
df[ !duplicated(x,fromLast=TRUE), ] # which gives me this:
x y z
1 1 10 1
6 4 13 6
7 2 12 7
9 1 30 9
NA NA NA NA
NA.1 NA NA NA
NA.2 NA NA NA
NA.3 NA NA NA
NA.4 NA NA NA
NA.5 NA NA NA
NA.6 NA NA NA
NA.7 NA NA NA
NA.8 NA NA NA
Not sure why I get the NA rows at the end (wasn't happening with a similar table I was testing), but works only partially on the values.
I have also tried using the data.table package as follows:
library(data.table)
dt <- as.data.table(df)
setkey(dt, x)
dt[J(unique(x)), mult ='last']
Works great, but it eliminates ALL duplicates from the data frame, not just those that are consecutive, giving something like this:
x y z
1 30 9
2 49 8
4 13 6
Please, forgive if cross-posting. I tried some of the suggestions but none worked for eliminating only those that are consecutive.
I would appreciate any help.
Thanks
How about:
df[cumsum(rle(df$x)$lengths),]
Explanation:
rle(df$x)
gives you the run lengths and values of consecutive duplicates in the x variable. Then:
rle(df$x)$lengths
extracts the lengths. Finally:
cumsum(rle(df$x)$lengths)
gives the row indices which you can select using [.
EDIT for fun here's a microbenchmark of the answers given so far with rle being mine, consec being what I think is the most fundamentally direct answer, given by #James, and would be the answer I would "accept", and dp being the dplyr answer given by #Nik.
#> Unit: microseconds
#> expr min lq mean median uq max
#> rle 134.389 145.4220 162.6967 154.4180 172.8370 375.109
#> consec 111.411 118.9235 136.1893 123.6285 145.5765 314.249
#> dp 20478.898 20968.8010 23536.1306 21167.1200 22360.8605 179301.213
rle performs better than I thought it would.
You just need to check in there is no duplicate following a number, i.e x[i+1] != x[i] and note the last value will always be present.
df[c(df$x[-1] != df$x[-nrow(df)],TRUE),]
x y z
3 1 30 3
5 2 49 5
6 4 13 6
8 2 49 8
9 1 30 9
A cheap solution with dplyr that I could think of:
Method:
library(dplyr)
df %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 30 9
This will even work if your data has the same x value at the bottom
New Input:
df2 <- df %>% add_row(x = 1, y = 10, z = 12)
df2
x y z
1 1 10 1
2 1 11 2
3 1 30 3
4 2 12 4
5 2 49 5
6 4 13 6
7 2 12 7
8 2 49 8
9 1 30 9
10 1 10 12
Use same method:
df2 %>%
mutate(id = lag(x, 1),
decision = if_else(x != id, 1, 0),
final = lead(decision, 1, default = 1)) %>%
filter(final == 1) %>%
select(-id, -decision, -final)
New Output:
x y z
1 1 30 3
2 2 49 5
3 4 13 6
4 2 49 8
5 1 10 12
Here is a data.table solution. The trick is to create a shifted version of x with the shift function and compare it with x
library(data.table)
dattab <- as.data.table(df)
dattab[x != shift(x = x, n = 1, fill = -999, type = "lead")] # edited to add closing )
This way you compare each value of x with its immediately following value and throw out where they match. Make sure to set fill to something that is not in x in order for correct handling of the last value.

Filter ids with having count > 1 in data.table [duplicate]

This question already has answers here:
Select groups based on number of unique / distinct values
(4 answers)
Closed last month.
I would like to subset my data frame to keep only groups that have 3 or more observations on DIFFERENT days. I want to get rid of groups that have less than 3 observations, or the observations they have are not from 3 different days.
Here is a sample data set:
Group Day
1 1
1 3
1 5
1 5
2 2
2 2
2 4
2 4
3 1
3 2
3 3
4 1
4 5
So for the above example, group 1 and group 3 will be kept and group 2 and 4 will be removed from the data frame.
I hope this makes sense, I imagine the solution will be quite simple but I can't work it out (I'm quite new to R and not very fast at coming up with solutions to things like this). I thought maybe the diff function could come in handy but didn't get much further.
With data.table you could do:
library(data.table)
DT[, if(uniqueN(Day) >= 3) .SD, by = Group]
which gives:
Group Day
1: 1 1
2: 1 3
3: 1 5
4: 1 5
5: 3 1
6: 3 2
7: 3 3
Or with dplyr:
library(dplyr)
DT %>%
group_by(Group) %>%
filter(n_distinct(Day) >= 3)
which gives the same result.
One idea using dplyr
library(dplyr)
df %>%
group_by(Group) %>%
filter(length(unique(Day)) >= 3)
#Source: local data frame [7 x 2]
#Groups: Group [2]
# Group Day
# (int) (int)
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#5 3 1
#6 3 2
#7 3 3
We can use base R
i1 <- rowSums(table(df1)!=0)>=3
subset(df1, Group %in% names(i1)[i1])
# Group Day
#1 1 1
#2 1 3
#3 1 5
#4 1 5
#9 3 1
#10 3 2
#11 3 3
Or a one-liner base R would be
df1[with(df1, as.logical(ave(Day, Group, FUN = function(x) length(unique(x)) >=3))),]

understanding apply and outer function in R

Suppose i have a data which looks like this
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
I Wanted to compare these values with each other so if an ID has changed its value of A variable over a period of B variable(which is from 1 to 4) it goes into data frame K and if it hasn't then it goes to data frame L.
so in this data set K will look like
ID A B C
1 X 1 10
1 X 2 10
1 Z 3 15
1 Y 4 12
2 Y 1 15
2 X 2 13
2 X 3 13
2 Y 4 13
and L will look like
ID A B C
3 Y 1 16
3 Y 2 18
3 Y 3 19
3 Y 4 10
In terms of nested loops and if then else statement it can be solved like following
for ( i in 1:length(ID)){
m=0
for (j in 1: length(B)){
ifelse( A[j] == A[j+1],m,m=m+1)
}
ifelse(m=0, L=c[,df[i]], K=c[,df[i]])
}
I have read in some posts that in R nested loops can be replaced by apply and outer function. if someone can help me understand how it can be used in such circumstances.
So basically you don't need a loop with conditions here, all you need to do is to check if there's a variance (and then converting it to a logical using !) in A during each cycle of B (IDs) by converting A to a numeric value (I'm assuming its a factor in your real data set, if its not a factor, you can use FUN = function(x) length(unique(x)) within ave instead ) and then split accordingly. With base R we can use ave for such task, for example
indx <- !with(df, ave(as.numeric(A), ID , FUN = var))
Or (if A is a character rather a factor)
indx <- with(df, ave(A, ID , FUN = function(x) length(unique(x)))) == 1L
Then simply run split
split(df, indx)
# $`FALSE`
# ID A B C
# 1 1 X 1 10
# 2 1 X 2 10
# 3 1 Z 3 15
# 4 1 Y 4 12
# 5 2 Y 1 15
# 6 2 X 2 13
# 7 2 X 3 13
# 8 2 Y 4 13
#
# $`TRUE`
# ID A B C
# 9 3 Y 1 16
# 10 3 Y 2 18
# 11 3 Y 3 19
# 12 3 Y 4 10
This will return a list with two data frames.
Similarly with data.table
library(data.table)
setDT(df)[, indx := !var(A), by = ID]
split(df, df$indx)
Or dplyr
library(dplyr)
df %>%
group_by(ID) %>%
mutate(indx = !var(A)) %>%
split(., indx)
Since you want to understand apply rather than simply getting it done, you can consider tapply. As a demonstration:
> tapply(df$A, df$ID, function(x) ifelse(length(unique(x))>1, "K", "L"))
1 2 3
"K" "K" "L"
In a bit plainer English: go through all df$A grouped by df$ID, and apply the function on df$A within each groupings (i.e. the x in the embedded function): if the number of unique values is more than 1, it's "K", otherwise it's "L".
We can do this using data.table. We convert the 'data.frame' to 'data.table' (setDT(df1)). Grouped by 'ID', we check the length of unique elements in 'A' (uniqueN(A)) is greater than 1 or not, create a column 'ind' based on that. We can then split the dataset based on that
'ind' column.
library(data.table)
setDT(df1)[, ind:= uniqueN(A)>1, by = ID]
setDF(df1)
split(df1[-5], df1$ind)
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
Or similarly using dplyr, we can use n_distinct to create a logical column and then split by that column.
library(dplyr)
df2 <- df1 %>%
group_by(ID) %>%
mutate(ind= n_distinct(A)>1)
split(df2, df2$ind)
Or a base R option with table. We get the table of the first two columns of 'df1' i.e. the 'ID' and 'A'. By double negating (!!) the output, we can get the '0' values convert to 'TRUE' and all other frequency as 'FALSE'. Get the rowSums ('indx'). We match the ID column in 'df1' with the names of the 'indx', use that to replace the 'ID' with TRUE/FALSE, and split the dataset with that.
indx <- rowSums(!!table(df1[1:2]))>1
lst <- split(df1, indx[match(df1$ID, names(indx))])
lst
#$`FALSE`
# ID A B C
#9 3 Y 1 16
#10 3 Y 2 18
#11 3 Y 3 19
#12 3 Y 4 10
#$`TRUE`
# ID A B C
#1 1 X 1 10
#2 1 X 2 10
#3 1 Z 3 15
#4 1 Y 4 12
#5 2 Y 1 15
#6 2 X 2 13
#7 2 X 3 13
#8 2 Y 4 13
If we need to get individual datasets on the global environment, change the names of the list elements to the object names we wanted and use list2env (not recommended though)
list2env(setNames(lst, c('L', 'K')), envir=.GlobalEnv)

Replacing the last value within groups with different values

My question is similar to this post, but the difference is instead of replacing the last value within each group/id with all 0's, different values are used to replace the last value within each group/id.
Here is an example (I borrowed it from the above link):
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 9999
6 2 0
7 2 9
8 2 500
9 3 0
10 3 1
In the above link, the last value within each group/id was replaced by a zero, using something like:
df %>%
group_by(id) %>%
mutate(Time = c(Time[-n()], 0))
And the output was
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 0
6 2 0
7 2 9
8 2 0
9 3 0
10 3 0
In my case, I would like the last value within each group/id to be replaced by a different value. Originally, the last value within each group/id was 9999, 500, and 1. Now I would like: 9999 is replaced by 5, 500 is replaced by 12, and 1 is replaced by 92. The desired output is:
id Time
1 1 3
2 1 10
3 1 1
4 1 0
5 1 5
6 2 0
7 2 9
8 2 12
9 3 0
10 3 92
I tried this one:
df %>%
group_by(id) %>%
mutate(Time = replace(Time, n(), c(5,12,92))),
but it did not work.
This could be solved using almost identical solution as I posted in the linked question. e.g., just replace 0L with the desired values
library(data.table)
indx <- setDT(df)[, .I[.N], by = id]$V1
df[indx, Time := c(5L, 12L, 92L)]
df
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 92
So to add some explanations:
.I is identical to row_number() or 1:n() in dplyr for an ungrouped data, e.g. 1:nrow(df) in base R
.N is like n() in dplyr, e.g., the size of a certain group (or the whole data set). So basically when I run .I[.N] by group, I'm retrieving the global index of the last row of each group
The next step is just use this index as a row index within df while assigning the desired values to Time by reference using the := operator.
Edit
Per OPs request, here's a possible dplyr solution. Your original solution doesn't work because you are working per group and thus you were trying to pass all three values to each group.
The only way I can think of is to first calculate group sizes, then ungroup and then mutate on the cumulative sum of these locations, something among these lines
library(dplyr)
df %>%
group_by(id) %>%
mutate(indx = n()) %>%
ungroup() %>%
mutate(Time = replace(Time, cumsum(unique(indx)), c(5, 12, 92))) %>%
select(-indx)
# Source: local data frame [10 x 2]
#
# id Time
# 1 1 3
# 2 1 10
# 3 1 1
# 4 1 0
# 5 1 5
# 6 2 0
# 7 2 9
# 8 2 12
# 9 3 0
# 10 3 92
Another way using data.table would be to create another data.table which contains the values to be replaced with for a given id, and then join and update by reference (simultaneously).
require(data.table) # v1.9.5+ (for 'on = ' feature)
replace = data.table(id = 1:3, val = c(5L, 12L, 9L)) # from #David
setDT(df)[replace, Time := val, on = "id", mult = "last"]
# id Time
# 1: 1 3
# 2: 1 10
# 3: 1 1
# 4: 1 0
# 5: 1 5
# 6: 2 0
# 7: 2 9
# 8: 2 12
# 9: 3 0
# 10: 3 9
In data.table, joins are considered as an extension of subsets. It's natural to think of doing whatever operation we do on subsets also on joins. Both operations do something on some rows.
For each replace$id, we find the last matching row (mult = "last") in df$id, and update that row with the corresponding val.
Installation instructions for v1.9.5 here. Hope this helps.

Resources