How to determine when a change in value occurs in R - r

I am following this example from stack overflow: Identifying where value changes in R data.frame column
Theres two columns: ind and value. How do I identify the 'ind' when 'value' increases by 100?
For example,
Value increases by 100 at ind = 4.
df <- data.frame(ind=1:10,
value=as.character(c(100,100,100,200,200,200,300,300,400,400)), stringsAsFactors=F)
df
ind value
1 1 100
2 2 100
3 3 100
4 4 200
5 5 200
6 6 200
7 7 300
8 8 300
9 9 400
10 10 400
I tried this but it doesn't work:
miss <- function(x) ifelse(is.finite(x),x,NA)
value_xx =miss(min(df$ind[df$value[1:length(df$value)] >= 100], Inf, na.rm=TRUE))

Like this:
df$ind[c(FALSE, diff(as.numeric(df$value)) == 100)]

You can use diff to get difference between consecutive values and get the index for those where the difference is greater than equal to 100. Added + 1 to the index since diff returns vector which of length 1 shorter than the original one.
df$ind[which(diff(df$value) >= 100) + 1]
#[1] 4 7 9
In dplyr, you can use lag to get previous values :
library(dplyr)
df %>% filter(value - lag(value) >= 100)
# ind value
#1 4 200
#2 7 300
#3 9 400

Related

rolling function with variable width R

I need to summarize some data using a rolling window of different width and shift. In particular I need to apply a function (eg. sum) over some values recorded on different intervals.
Here an example of a data frame:
df <- tibble(days = c(0,1,2,3,1),
value = c(5,7,3,4,2))
df
# A tibble: 5 x 2
days value
<dbl> <dbl>
1 0 5
2 1 7
3 2 3
4 3 4
5 1 2
The columns indicate:
days how many days elapsed from the previous observation. The first value is 0 because no previous observation.
value the value I need to aggregate.
Now, let's assume that I need to sum the field value every 4 days shifting 1 day at the time.
I need something along these lines:
days value roll_sum rows_to_sum
0 5 15 1,2,3
1 7 10 2,3
2 3 3 3
3 4 6 4,5
1 2 NA NA
The column rows_to_sum has been added to make it clear.
Here more details:
The first value (15), is the sum of the 3 rows because 0+1+2 = 3 which is less than the reference value 4 and adding the next line (with value 3) will bring the total day count to 7 which is more than 4.
The second value (10), is the sum of row 2 and 3. This is because, excluding the first row (since we are shifting one day), we only summing row 2 and 3 because including row 4 will bring the total sum of days to 1+2+3 = 6 which is more than 4.
...
How can I achieve this?
Thank you
Here is one way :
library(dplyr)
library(purrr)
df %>%
mutate(roll_sum = map_dbl(row_number(), ~{
i <- max(which(cumsum(days[.x:n()]) <= 4))
if(is.na(i)) NA else sum(value[.x:(.x + i - 1)])
}))
# days value roll_sum
# <dbl> <dbl> <dbl>
#1 0 5 15
#2 1 7 10
#3 2 3 3
#4 3 4 6
#5 1 2 2
Performing this calculation in base R :
sapply(seq(nrow(df)), function(x) {
i <- max(which(cumsum(df$days[x:nrow(df)]) <= 4))
if(is.na(i)) NA else sum(df$value[x:(x + i - 1)])
})

assign values from same row in R

I want to assign the row value of B to row A only when A = 1. This is what I have done so far:
Data frame:
df <- data.frame('A' = c(1,1,2,5,4,3,1,2), 'B' = c(100,200,200,200,100,200,100,200))
A B
1 1 100
2 1 200
3 2 200
4 5 200
5 4 100
6 3 200
7 1 100
8 2 200
Output:
df$A[df$A == 1] <- df$B
A B
1 100 100
2 200 200
3 2 200
4 5 200
5 4 100
6 3 200
7 200 100
8 2 200
As you can see, rows 1 and 2 do what they are supposed to do. However, row 7 doesn't, but instead takes the value from row 3 - it is assigning values sequentially.
My question: how do I assign values that takes the inputs from the same row?
Use:
df$A[df$A == 1] <- df$B[df$A == 1]
You need to apply the same index to both, column to be replaced and column that holds the replacements.

Identifying where value changes in R data.frame column

I have a data.frame in R where the value column contains data of the class character. I want to identify the row numbers where value changes. In the example below I want to get out 4, 7, and 9. Is there a way to do this without looping?
df <- data.frame(ind=1:10,
value=as.character(c(100,100,100,200,200,200,300,300,400,400)),
stringsAsFactors=F)
df
ind value
1 1 100
2 2 100
3 3 100
4 4 200
5 5 200
6 6 200
7 7 300
8 8 300
9 9 400
10 10 400
A simple solution is to use the lag function in dplyr:
which(df$value != dplyr::lag(df$value))
Similar to #thc's answer, but without a dependency:
which(c(FALSE, tail(df$value,-1) != head(df$value,-1)))
#[1] 4 7 9
You can use rle (Run Length Encoding):
cumsum(rle(df$value)$lengths)+1
[1] 4 7 9 11
You can use head to drop the last value:
head(cumsum(rle(df$value)$lengths)+1, -1)

Delete following observations when goal has been reached

Given the dataframe:
df = data.frame(
ID = c(1,1,1,1,2,3,3),
Start = c(0,8,150,200,6,7,60),
Stop = c(5,60,170,210,NA,45,80))
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
3 1 150 170 1
4 1 200 210 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
For each ID, I would like to keep all rows until Start[i+1] - Stop[i] >= 28, and then delete the following observations of that ID
In this example, the output should be
ID Start Stop Dummy
1 1 0 5 0
2 1 8 60 1
5 2 6 NA 0
6 3 7 45 0
7 3 60 80 1
I ended up having to set NA's to a value easy to identify later and the following code
df$Stop[is.na(df$Stop)] = 10000
df$diff <- df$Start-c(0,df$Stop[1:length(df$Stop)-1])
space <- with(df, unique(ID[diff<28]))
df2 <- subset(df, (ID %in% space & diff < 28) | !ID %in% space)
Using data.table...
library(data.table)
setDT(df)
df[,{
w = which( shift(Start,type="lead") - Stop >= 28 )
if (length(w)) .SD[seq(w[1])] else .SD
}, by=ID]
# ID Start Stop
# 1: 1 0 5
# 2: 1 8 60
# 3: 2 6 NA
# 4: 3 7 45
# 5: 3 60 80
.SD is the Subset of Data associated with each by=ID group.
Create a diff column.
df$diff<-df$Start-c(0,df$Stop[1:length(df$Stop)-1])
Subset on the basis of this column
df[df$diff<28,]
PS: I have converted 'NA' to 0. You would have to handle that anyway.
p <- which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28)
df <- df[p,]
Assuming you want to keep entries where next entry start if higher than giben entry stop by 28 or more
The result is:
>p 2 3
> df[p,]
ID Start Stop
2 1 8 60
3 1 150 170
start in row 2 ( i + 1 = 2) is higher than stop in row 1 (i=1) by 90.
Or, if by until you mean the reverse condition, then
df <- df[which(df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] < 28),]
Inclusion of NA in your data frame got me thinking. You have to be very careful how you word your condition. If you want to keep all the cases where difference between next start and stop is less than 28, then the above statement will do.
However, if you want to keep all cases EXCEPT when difference is 28 or more, then you should
p <- which((df$Start[2:nrow(df)]-df$Stop[1:(nrow(df)-1)] >= 28))
rp <- which((!is.element(1:nrow(df),p)))
df <- df[rp,]
As it will include the unknown difference.

If Statements and logical operators in R

I have a dataframe with a Money column and an Age Group column.
The Money column has NAs and the Age Group column has values that range from 1 to 5.
What I want to do is find the sum of the Money column when the AgeGroup column equals a certain value. Say 5 for this example.
I have been attempting to use an if statement but I am getting the response "the condition has length > 1 and only the first element will be used".
if(df$AgeGroup == 5)
SumOfMoney <- sum(df$Money)
My problem is I don't know how to turn "if" into "when". I want to sum the Money column when those rows that have an AgeGroup value of 5, or 3, or whatever I choose.
I believe I have the condition correct, do I add a second if statement when calculating the sum?
I would use data.table for this 'by-group' operation.
library(data.table)
setDT(df)[,list(sm=sum(Money,na.rm=TRUE)),AgeGroup]
This will compute the sum of money by group. Filtering the result to get some group value :
setDT(df)[,list(sm=sum(Money,na.rm=TRUE)),AgeGroup][AgeGroup==4]
Try:
library(dplyr)
df %>%
group_by(AgeGroup) %>%
summarise(Money = sum(Money, na.rm = TRUE))
Which gives:
#Source: local data frame [5 x 2]
#
# AgeGroup Money
#1 1 1033
#2 2 793
#3 3 224
#4 4 133
#5 5 103
If you want to subset for a specific AgeGroup you could add:
... %>% filter(AgeGroup == 5)
Try:
set.seed(7)
df <- data.frame(AgeGroup = sample(1:5, 10, T), Money = sample(100:500, 10))
df[1,2] <- NA
AgeGroup Money
1 5 NA
2 2 192
3 1 408
4 1 138
5 2 280
6 4 133
7 2 321
8 5 103
9 1 487
10 3 224
with(df, tapply(Money, AgeGroup, FUN= sum, na.rm=T))
1 2 3 4 5
1033 793 224 133 103
If you would like to just have the sum of one group at a time try:
sum(df[df$AgeGroup == 5,"Money"], na.rm=T)
[1] 103
I think the following function should do the trick.
> AGE <- c(1,2,3,2,5,5)
> MONEY <- c(100,200,300,400,200,100)
> dat <- data.frame(cbind(AGE,MONEY))
> dat
AGE MONEY
1 1 100
2 2 200
3 3 300
4 2 400
5 5 200
6 5 100
> getSumOfGroup <- function(df, group){
+ return(sum(df[AGE == group,"MONEY"]))
+ }
> getSumOfGroup(dat, 5)
[1] 300

Resources