Create a dummy if variable change [duplicate] - r

I have a data.frame in R where the value column contains data of the class character. I want to identify the row numbers where value changes. In the example below I want to get out 4, 7, and 9. Is there a way to do this without looping?
df <- data.frame(ind=1:10,
value=as.character(c(100,100,100,200,200,200,300,300,400,400)),
stringsAsFactors=F)
df
ind value
1 1 100
2 2 100
3 3 100
4 4 200
5 5 200
6 6 200
7 7 300
8 8 300
9 9 400
10 10 400

A simple solution is to use the lag function in dplyr:
which(df$value != dplyr::lag(df$value))

Similar to #thc's answer, but without a dependency:
which(c(FALSE, tail(df$value,-1) != head(df$value,-1)))
#[1] 4 7 9

You can use rle (Run Length Encoding):
cumsum(rle(df$value)$lengths)+1
[1] 4 7 9 11
You can use head to drop the last value:
head(cumsum(rle(df$value)$lengths)+1, -1)

Related

Reuse value of previous row during dplyr::mutate

I am trying to group events based on their time of occurrence. To achieve this, I simply calculate a diff over the timestamps and want to essentially start a new group if the diff is larger than a certain value. I would have tried like the code below. However, this is not working since the dialog variable is not available during the mutate it is created by.
library(tidyverse)
df <- data.frame(time = c(1,2,3,4,5,510,511,512,513), id = c(1,2,3,4,5,6,7,8,9))
> df
time id
1 1 1
2 2 2
3 3 3
4 4 4
5 5 5
6 510 6
7 511 7
8 512 8
9 513 9
df <- df %>%
mutate(t_diff = c(NA, diff(time))) %>%
# This generates an error as dialog is not available as a variable at this point
mutate(dialog = ifelse(is.na(t_diff), id, ifelse(t_diff >= 500, id, lag(dialog, 1))))
# This is the desired result
> df
time id t_diff dialog
1 1 1 NA 1
2 2 2 1 1
3 3 3 1 1
4 4 4 1 1
5 5 5 1 1
6 510 6 505 6
7 511 7 1 6
8 512 8 1 6
9 513 9 1 6
In words, I want to add a column that points to the first element of each group. Thereby, the groups are distinguished at points at which the diff to the previous element is larger than 500.
Unfortunately, I have not found a clever workaround to achieve this in an efficient way using dplyr. Obviously, iterating over the data.frame with a loop would work, but would be very inefficient.
Is there a way to achieve this in dplyr?

Identifying where value changes in R data.frame column

I have a data.frame in R where the value column contains data of the class character. I want to identify the row numbers where value changes. In the example below I want to get out 4, 7, and 9. Is there a way to do this without looping?
df <- data.frame(ind=1:10,
value=as.character(c(100,100,100,200,200,200,300,300,400,400)),
stringsAsFactors=F)
df
ind value
1 1 100
2 2 100
3 3 100
4 4 200
5 5 200
6 6 200
7 7 300
8 8 300
9 9 400
10 10 400
A simple solution is to use the lag function in dplyr:
which(df$value != dplyr::lag(df$value))
Similar to #thc's answer, but without a dependency:
which(c(FALSE, tail(df$value,-1) != head(df$value,-1)))
#[1] 4 7 9
You can use rle (Run Length Encoding):
cumsum(rle(df$value)$lengths)+1
[1] 4 7 9 11
You can use head to drop the last value:
head(cumsum(rle(df$value)$lengths)+1, -1)

Summing a column to a certain value

I have a data.frame with 2 variables, and 177 observations. I would like to sum up one variable to a certain value, and then get the value of the other variable when that threshold is reached. I will try to add an reproducible example. I am new here so forgive me if I do it wrong.
> df <- data.frame(x=10:1,y=1:10)
> print(df)
x y
1 10 1
2 9 2
3 8 3
4 7 4
5 6 5
6 5 6
7 4 7
8 3 8
9 2 9
10 1 10
How can I sum column y until it reaches a certain value, let's say 7, and then either have it return the value of X(4), or the row number 7. I am sure it is pretty straightforward, but I seem to be drawing a blank.
Here is my solution.
df[cumsum(df$y) <= 7,]
x y
1 10 1
2 9 2
3 8 3
The OP just asked for the relevant value of x which would be done using:
df$x[which(cumsum(df$y) >= 10)[1]]
Also note this finds the first where cumsum(df$y) is at least 10 whereas the other answers find the last <= 7 which is potentially different (though not for this dataset). For the original question (pre-comment) it would need to be:
df$x[which(cumsum(df$y) > 7)[1]]
If you want to stay with base R, try this
> df$x[df$y >= 7][1]
[1] 4
> max(cumsum(df$y[df$y <= 7]))
[1] 28
Or if you need this in a matrix form:
> cbind(df$x[df$y >= 7][1], max(cumsum(df$y[df$y <= 7])))
[,1] [,2]
[1,] 4 28
I would still look into switching to data.table or at least dplyr packages for data manipulation.

recursive replacement in R

I am trying to clean some data and would like to replace zeros with values from the previous date. I was hoping the following code works but it doesn't
temp = c(1,2,4,5,0,0,6,7)
temp[which(temp==0)]=temp[which(temp==0)-1]
returns
1 2 4 5 5 0 6 7
instead of
1 2 4 5 5 5 6 7
Which I was hoping for.
Is there a nice way of doing this without looping?
The operation is called "Last Observation Carried Forward" and usually used to fill data gaps. It's a common operation for time series and thus implemented in package zoo:
temp = c(1,2,4,5,0,0,6,7)
temp[temp==0] <- NA
library(zoo)
na.locf(temp)
#[1] 1 2 4 5 5 5 6 7
You could use essentially your same logic except you'll want to apply it to the values vector that results from using rle
temp = c(1,2,4,5,0,0,6,0)
o <- rle(temp)
o$values[o$values == 0] <- o$values[which(o$values == 0) - 1]
inverse.rle(o)
#[1] 1 2 4 5 5 5 6 6

Excel OFFSET function in r

I am trying to simulate the OFFSET function from Excel. I understand that this can be done for a single value but I would like to return a range. I'd like to return a group of values with an offset of 1 and a group size of 2. For example, on row 4, I would like to have a group with values of column a, rows 3 & 2. Sorry but I am stumped.
Is it possible to add this result to the data frame as another column using cbind or similar? Alternatively, could I use this in a vectorized function so I could sum or mean the result?
Mockup Example:
> df <- data.frame(a=1:10)
> df
a
1 1
2 2
3 3
4 4
5 5
6 6
7 7
8 8
9 9
10 10
> #PROCESS
> df
a b
1 1 NA
2 2 (1)
3 3 (1,2)
4 4 (2,3)
5 5 (3,4)
6 6 (4,5)
7 7 (5,6)
8 8 (6,7)
9 9 (7,8)
10 10 (8,9)
This should do the trick:
df$b1 <- c(rep(NA, 1), head(df$a, -1))
df$b2 <- c(rep(NA, 2), head(df$a, -2))
Note that the result will have to live in two columns, as columns in data frames only support simple data types. (Unless you want to resort to complex numbers.) head with a negative argument cuts the negated value of the argument from the tail, try head(1:10, -2). rep is repetition, c is concatenation. The <- assignment adds a new column if it's not there yet.
What Excel calls OFFSET is sometimes also referred to as lag.
EDIT: Following Greg Snow's comment, here's a version that's more elegant, but also more difficult to understand:
df <- cbind(df, as.data.frame((embed(c(NA, NA, df$a), 3))[,c(3,2)]))
Try it component by component to see how it works.
Do you want something like this?
> df <- data.frame(a=1:10)
> b=t(sapply(1:10, function(i) c(df$a[(i+2)%%10+1], df$a[(i+4)%%10+1])))
> s = sapply(1:10, function(i) sum(b[i,]))
> df = data.frame(df, b, s)
> df
a X1 X2 s
1 1 4 6 10
2 2 5 7 12
3 3 6 8 14
4 4 7 9 16
5 5 8 10 18
6 6 9 1 10
7 7 10 2 12
8 8 1 3 4
9 9 2 4 6
10 10 3 5 8

Resources