This question already has answers here:
Create group number for contiguous runs of equal values
(4 answers)
Closed 7 years ago.
I have a data_frame where a character variable x changes in time. I want to count the number of times it changes, and fill a new vector with this count.
df <- data_frame(
x = c("a", "a", "b", "b", "c", "b"),
wanted = c(1, 1, 2, 2, 3, 4)
)
x wanted
1 a 1
2 a 1
3 b 2
4 b 2
5 c 3
6 b 4
This is similar to, but different from rle(df$x), which would return
Run Length Encoding
lengths: int [1:4] 2 2 1 1
values : chr [1:4] "a" "b" "c" "b"
I could try to rep() that output. I have also tried this, which is awfully close, but not for reasons I can't figure out immediately:
df %>% mutate(
try_1 = cumsum(ifelse(x == lead(x) | is.na(lead(x)), 1, 0))
)
Source: local data frame [6 x 3]
x wanted try_1
1 a 1 1
2 a 1 1
3 b 2 2
4 b 2 2
5 c 3 2
6 b 4 3
It seems like there should be a function that does this directly, that I just haven't found in my experience.
Try this dplyr code:
df %>%
mutate(try_1 = cumsum(ifelse(x != lag(x) | is.na(lag(x)), 1, 0)))
x wanted try_1
1 a 1 1
2 a 1 1
3 b 2 2
4 b 2 2
5 c 3 3
6 b 4 4
Yours was saying: increment the count if a value is the same as the following row's value, or if the following row's value is NA.
This says: increment the count if the variable on this row either is different than the one on the previous row, or if there wasn't one on the previous row (e.g., row 1).
You can try
library(data.table) #data.table_1.9.5
setDT(df)[, wanted := rleid(x)][]
# x wanted
#1: a 1
#2: a 1
#3: b 2
#4: b 2
#5: c 3
#6: b 4
Or a base R option would be
inverse.rle(within.list(rle(as.character(df$x)),
values<- seq_along(values)))
#[1] 1 1 2 2 3 4
data
df <- data.frame(x=c("a", "a", "b", "b", "c", "b"))
Related
Suppose you have a data frame of columns "a" and "b" with the values shown below, generated with df <- data.frame(a=c(0, 1, 2, 2, 3), b=c(1, 3, 8, 9, 4)). Suppose you want to add a column "c", whereby if a value in "a" equals the value in the immediately preceding row in col "a", then the corresponding row values in col "b" are summed; otherwise a 0 value is shown. A column "c" is added to the below to illustrate what I'm trying to do:
a b add col c
1 0 1 0
2 1 3 0
3 2 8 0
4 2 9 17 (since the values in col "a" rows 3 and 4 are equal, add the values in col b rows 3 and 4)
5 3 4 0
Or in this scenario, whereby cols "a" and "b" are generated by df <- data.frame(a=c(0,1,2,2,2,3), b=c(1,2,3,4,5,6)):
a b add col c
1 0 1 0
2 1 2 0
3 2 3 0
4 2 4 7 (3+4 from col "b")
5 2 5 9 (4+5 from col "b")
6 3 6 0 (since 2 from prior row <> 3 from current row)
What is the easiest way to do this in native R?
As we are interested in the adjacent values to be equal, use rleid (from data.table) to create a grouping index, then create the 'c', by adding the 'b' with lag of 'b' and replace the default first value of lag (NA) to 0
library(dplyr)
library(data.table)
library(tidyr)
df %>%
group_by(grp = rleid(a)) %>%
mutate(c = replace_na(b + lag(b), 0)) %>%
ungroup %>%
select(-grp)
-output
# A tibble: 6 × 3
a b c
<dbl> <dbl> <dbl>
1 0 1 0
2 1 2 0
3 2 3 0
4 2 4 7
5 2 5 9
6 3 6 0
Or using base R - a similar approach is with the rle to create the 'grp', then use ave to do the addition of previous with current value (by removing the first and last) and then append 0 at the beginning
grp <- with(rle(df$a), rep(seq_along(values), lengths))
df$c <- with(df, ave(b, grp, FUN = function(x) c(0, x[-1] + x[-length(x)])))
I have a dataframe containing multiple groups that are not explicitly stated. Instead, new group always start when type == 1, and is the same for following rows, containing type == 2. The number of rows per group can vary.
How can I explicitly create new variable based on order of another column? The groups, of course, should be exclusive.
My data:
df <- data.frame(type = c(1,2,2,1,2,1,2,2,2,1),
stand = 1:10)
Expected output with new group myGroup:
type stand myGroup
1 1 1 a
2 2 2 a
3 2 3 a
4 1 4 b
5 2 5 b
6 1 6 c
7 2 7 c
8 2 8 c
9 2 9 c
10 1 10 d
One option could be:
with(df, letters[cumsum(type == 1)])
[1] "a" "a" "a" "b" "b" "c" "c" "c" "c" "d"
Here is another option using rep() + diff(), but not as simple as the approach by #tmfmnk
idx <- which(df$type==1)
v <- diff(which(df$type==1))
df$myGroup <- rep(letters[seq(idx)],c(v <- diff(which(df$type==1)),nrow(df)-sum(v)))
such that
> df
type stand myGroup
1 1 1 a
2 2 2 a
3 2 3 a
4 1 4 b
5 2 5 b
6 1 6 c
7 2 7 c
8 2 8 c
9 2 9 c
10 1 10 d
This question already has answers here:
Repeat each row of data.frame the number of times specified in a column
(10 answers)
Closed 3 years ago.
How is it possible to transform this data frame so that the count is divided into separate observations?
df = data.frame(object = c("A","B", "A", "C"), count=c(1,2,3,2))
object count
1 A 1
2 B 2
3 A 3
4 C 2
So that the resulting data frame looks like this?
object observation
1 A 1
2 B 1
3 B 1
4 A 1
5 A 1
6 A 1
7 C 1
8 C 1
rep(df$object, df$count)
If you want the 2 columns:
df2 = data.frame(object = rep(df$object, df$count))
df2$count = 1
If you're working with tidyverse - otherwise that's overkill -, you could also do:
library(tidyverse)
uncount(df, count) %>% mutate(observation = 1)
Using data.table:
library(data.table)
setDF(df)[rep(seq_along(count), count), .(object, count = 1L)]
object count
1: A 1
2: B 1
3: B 1
4: A 1
5: A 1
6: A 1
7: C 1
8: C 1
I have a dataframe in the following format
> x <- data.frame("a" = c(1,1),"b" = c(2,2),"c" = c(3,4))
> x
a b c
1 1 2 3
2 1 2 4
I'd like to add 3 new columns which is a cumulative product of the columns a b c, however I need a reverse cumulative product i.e. the output should be
row 1:
result_d = 1*2*3 = 6 , result_e = 2*3 = 6, result_f = 3
and similarly for row 2
The end result will be
a b c result_d result_e result_f
1 1 2 3 6 6 3
2 1 2 4 8 8 4
the column names do not matter this is just an example. Does anyone have any idea how to do this?
as per my comment, is it possible to do this on a subset of columns? e.g. only for columns b and c to return:
a b c results_e results_f
1 1 2 3 6 3
2 1 2 4 8 4
so that column "a" is effectively ignored?
One option is to loop through the rows and apply cumprod over the reverse of elements and then do the reverse
nm1 <- paste0("result_", c("d", "e", "f"))
x[nm1] <- t(apply(x, 1,
function(x) rev(cumprod(rev(x)))))
x
# a b c result_d result_e result_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
Or a vectorized option is rowCumprods
library(matrixStats)
x[nm1] <- rowCumprods(as.matrix(x[ncol(x):1]))[,ncol(x):1]
temp = data.frame(Reduce("*", x[NCOL(x):1], accumulate = TRUE))
setNames(cbind(x, temp[NCOL(temp):1]),
c(names(x), c("res_d", "res_e", "res_f")))
# a b c res_d res_e res_f
#1 1 2 3 6 6 3
#2 1 2 4 8 8 4
I would like to break a dataset into two frames - one for which the original dataset has duplicate observations based on a condition and one for which the original dataset does not have duplicate observations based on a condition. In the following example, I would like to break the frame into one for which there is only one coder for an observation and one for which there are two coders::
frame <- data.frame(id = c(1,1,1,2,2,3), coder = c("A", "A", "B", "A", "B", "A"), y = c(4,5,4,1,1,2))
frame
For this, I would like to produce, such that:
frame1:
id coder y
1 1 A 4
2 1 A 5
3 1 B 4
4 2 A 1
5 2 B 1
frame2:
6 3 A 2
You can use aggregate to determine the ids you want in each data frame:
cts <- aggregate(coder~id, frame, function(x) length(unique(x)))
cts
# id coder
# 1 1 2
# 2 2 2
# 3 3 1
Then you can subset as appropriate based on this:
subset(frame, id %in% cts$id[cts$coder >= 2])
# id coder y
# 1 1 A 4
# 2 1 A 5
# 3 1 B 4
# 4 2 A 1
# 5 2 B 1
subset(frame, id %in% cts$id[cts$coder < 2])
# id coder y
# 6 3 A 2
You may also try:
indx <- !colSums(!table(frame$coder, frame$id))
frame[frame$id %in% names(indx)[indx],]
# id coder y
#1 1 A 4
#2 1 A 5
#3 1 B 4
#4 2 A 1
#5 2 B 1
frame[frame$id %in% names(indx)[!indx],]
# id coder y
#6 3 A 2
Explanation
table(frame$coder, frame$id)
# 1 2 3
# A 2 1 1
# B 1 1 0 #Here for id 3, B==0
If we Negate that, the result would be a logical index
!table(frame$coder, frame$id).
Do the colSums of the above, which results
# 1 2 3
# 0 0 1
Negate again and get the index for ids and subset those ids which are TRUE
From this you can subset by matching with the names of the ids