Remove/collapse consecutive duplicate values in sequence

Remove/collapse consecutive duplicate values in sequence - r

I have the following dataframe:
a a a b c c d e a a b b b e e d d
The required result should be
a b c d e a b e d
It means no two consecutive rows should have same value. How it can be done without using loop.
As my data set is quite huge, looping is taking lot of time to execute.
The dataframe structure is like the following
a 1
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10
Result:
a 1
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4
Its should delete the entire row.

One easy way is to use rle:
Here's your sample data:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items
rle returns a list with two values: the run length ("lengths"), and the value that is repeated for that run ("values").
rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
Update: For a data.frame
If you are working with a data.frame, try something like the following:
## Sample data
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10)
)
## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1] 1 4 5 7 8 9 11 13 15
mydf[Y, ]
# V1 V2
# 1 a 1
# 4 b 2
# 5 c 4
# 7 d 3
# 8 e 9
# 9 a 4
# 11 b 10
# 13 e 2
# 15 d 4
Update 2
The "data.table" package has a function rleid that lets you do this quite easily. Using mydf from above, try:
library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
# rleid V2
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 3
# 5: 5 9
# 6: 6 4
# 7: 7 10
# 8: 8 2
# 9: 9 4

library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
EDIT: For data.frame
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10),
stringsAsFactors=FALSE)
dplyr solution is one liner:
mydf %>% filter(V1!= lag(V1, default="1"))
# V1 V2
#1 a 1
#2 b 2
#3 c 4
#4 d 3
#5 e 9
#6 a 4
#7 b 10
#8 e 2
#9 d 4
post scriptum
lead(x,1) suggested by #Carl Witthoft iterates in reverse order.
leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]
# V1 V2
#3 a 3
#4 b 2
#6 c 1
#7 d 3
#8 e 9
#10 a 8
#12 b 199
#14 e 5
#16 d 10

With base R, I like funny algorithmics:
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Much as I like,... errr, love rle , here's a shootoff:
EDIT: Can't figure out exactly what's up with dplyr so I used dplyr::lead . I'm on OSX, R3.1.2, and latest dplyr from CRAN.
xlet<-sample(letters,1e5,rep=T)
rleit<-function(x) rle(x)$values
lagit<-function(x) x[x!=lead(x, default=1)]
tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]
microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
Unit: milliseconds
expr min lq median uq max neval
rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657 20
lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940 20
tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840 20

Tidyverse solution:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x)
x |>
mutate(id = consecutive_id(x)) |>
distinct(x, id)
In addition, if there is another column y associated with the consecutive values column, this solution allows some flexibility:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
x <- tibble(x, y = runif(length(x)))
x |>
group_by(id = consecutive_id(x)) |>
slice_min(y)
We can choose between the different slice functions, like slice_max, slice_min, slice_head, and slice_tail.
This Stack Overflow thread appeared in the second edition of R4DS, in the Numbers chapter of the book.

Related

R create serial number based on two different columns [duplicate]

I have a data frame, which looks like this:
DF_A <- data.frame(
Group_1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
Group_2 = c("A", "B", "C", "A", "B", "A", "B", "A", "C", "A")
)
I would like to assign a consecutive number for Group_1 IDs which should be unique for the case of identical Group_2 IDs. For example, A+A starts with 1, A+B proceeds with 2 (same Group_1 ID, but new Group_2 ID), ..., A+A is again 1 (obviously a repetition). B+A is 1 (new Group_1 ID), ..., B+A (same Group_1 ID, but new Group_2 ID)...and so forth.
The result should look like this.
DF_B <- data.frame(
Group_1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
Group_2 = c("A", "B", "C", "A", "B", "A", "B", "A", "C", "A"),
ID = c(1, 2, 3, 1, 2, 1, 2, 1, 1, 1)
)
I investigated various posts on corresponding approaches such as single groups within groups, or a combination - without any success - this case is not covered by previous posts.
Thank you in advance.

One way to do it with ave is
DF_A$ID <- ave(DF_A$Group_2, DF_A$Group_1, FUN = function(x) match(x, unique(x)))
DF_A
# Group_1 Group_2 ID
#1 A A 1
#2 A B 2
#3 A C 3
#4 A A 1
#5 A B 2
#6 B A 1
#7 B B 2
#8 B A 1
#9 B C 3
#10 C A 1
The equivalent dplyr way is :
library(dplyr)
DF_A %>%
group_by(Group_1) %>%
mutate(ID = match(Group_2, unique(Group_2)))

You can split into groups by Group_1, then create factor out of your combinations within each group then convert into integer
DF_A$ID <- unlist(by(DF_A, DF_A$Group_1, function(x) as.integer(factor(x$Group_2))))

We can use the dense_rank from dplyr.
library(dplyr)
DF_A2 <- DF_A %>%
group_by(Group_1) %>%
mutate(ID = dense_rank(Group_2)) %>%
ungroup()
DF_A2
# # A tibble: 10 x 3
# Group_1 Group_2 ID
# <fct> <fct> <int>
# 1 A A 1
# 2 A B 2
# 3 A C 3
# 4 A A 1
# 5 A B 2
# 6 B A 1
# 7 B B 2
# 8 B A 1
# 9 B C 3
# 10 C A 1

You could use the integer values of the factor levels. We can simply wrap Group_2 in c() to drop the factor attribute.
within(DF_A, { ID = ave(c(Group_2), Group_1, FUN = c) })
# Group_1 Group_2 ID
# 1 A A 1
# 2 A B 2
# 3 A C 3
# 4 A A 1
# 5 A B 2
# 6 B A 1
# 7 B B 2
# 8 B A 1
# 9 B C 3
# 10 C A 1

Matching rows to columns and counting same occurences R

I have a dataset which is of the following form:-
a <- data.frame(X1=c("A", "B", "C", "A", "B", "C"),
X2=c("B", "C", "C", "A", "A", "B"),
X3=c("B", "E", "A", "A", "A", "B"),
X4=c("E", "C", "A", "A", "A", "C"),
X5=c("A", "C", "C", "A", "B", "B")
)
And I have another set of the following form:-
b <- data.frame(col_1=c("ASD", "ASD", "BSD", "BSD"),
col_2=c(1, 1, 1, 1),
col_3=c(12, 12, 31, 21),
col_4=("A", "B", "B", "A")
)
What I want to do is to take the column col_4 from set b and match row wise in set a, so that it tell me which row has how many elements from col_4 in a new column. The name of the new column does not matters.
For ex:- The first and fifth row in set a has all the elements of col_4 from set b.
Also, duplicates shouldn't be found. For ex. sixth row in set a has 3 "B"s. But since col_4 from set b has only two "B"s, it should tell me 2 and not 3.
Expected output is of the form:-
c <- data.frame(X1=c("A", "B", "C", "A", "B", "C"),
X2=c("B", "C", "C", "A", "A", "B"),
X3=c("B", "E", "A", "A", "A", "B"),
X4=c("E", "C", "A", "A", "A", "C"),
X5=c("A", "C", "C", "A", "B", "B"),
found=c(4, 1, 2, 2, 4, 2)
)

We can use vecsets::vintersect which takes care of duplicates.
Using apply row-wise we can count how many common values are there between b$col4 and each row in a.
apply(a, 1, function(x) length(vecsets::vintersect(b$col_4, x)))
#[1] 4 1 2 2 4 2

An option using data.table:
library(data.table)
#convert a into a long format
m <- melt(setDT(a)[, rn:=.I], id.vars="rn", value.name="col_4")
#order by row number and create an index for identical occurrences in col_4
setorder(m, rn, col_4)[, vidx := rowid(col_4), rn]
#create a similar index for b
setDT(b, key="col_4")[, vidx := rowid(col_4)]
#count occurrences and lookup this count into original data
a[b[m, on=.(col_4, vidx), nomatch=0L][, .N, rn], on=.(rn), found := N]
output:
X1 X2 X3 X4 X5 rn found
1: A B B E A 1 4
2: B C E C C 2 1
3: C C A A C 3 2
4: A A A A A 4 2
5: B A A A B 5 4
6: C B B C B 6 2

Another idea to operate on sets efficiently is to count and compare the element occurences of b$col_4 in each row of a:
b1 = c(table(b$col_4))
#b1
#A B
#2 2
a1 = table(factor(as.matrix(a), names(b1)), row(a))
#a1
#
# 1 2 3 4 5 6
# A 2 0 2 5 3 0
# B 2 1 0 0 2 3
Finally, identify the least amount of occurences per element (for each row) and sum:
colSums(pmin(a1, b1))
#1 2 3 4 5 6
#4 1 2 2 4 2
In case of a larger dimension a "data.frame" and more elements, Matrix::sparseMatrix offers an appropriate alternative:
library(Matrix)
a.fac = factor(as.matrix(a), names(b1))
.i = as.integer(a.fac)
.j = c(row(a))
noNA = !is.na(.i) ## need to remove NAs manually
.i = .i[noNA]
.j = .j[noNA]
a1 = sparseMatrix(i = .i, j = .j, x = 1L, dimnames = list(names(b1), 1:nrow(a)))
a1
#2 x 6 sparse Matrix of class "dgCMatrix"
# 1 2 3 4 5 6
#A 2 . 2 5 3 .
#B 2 1 . . 2 3
colSums(pmin(a1, b1))
#1 2 3 4 5 6
#4 1 2 2 4 2

Consecutive Across and Unique Number Within Group

I have a data frame, which looks like this:
DF_A <- data.frame(
Group_1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
Group_2 = c("A", "B", "C", "A", "B", "A", "B", "A", "C", "A")
)
I would like to assign a consecutive number for Group_1 IDs which should be unique for the case of identical Group_2 IDs. For example, A+A starts with 1, A+B proceeds with 2 (same Group_1 ID, but new Group_2 ID), ..., A+A is again 1 (obviously a repetition). B+A is 1 (new Group_1 ID), ..., B+A (same Group_1 ID, but new Group_2 ID)...and so forth.
The result should look like this.
DF_B <- data.frame(
Group_1 = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "C"),
Group_2 = c("A", "B", "C", "A", "B", "A", "B", "A", "C", "A"),
ID = c(1, 2, 3, 1, 2, 1, 2, 1, 1, 1)
)
I investigated various posts on corresponding approaches such as single groups within groups, or a combination - without any success - this case is not covered by previous posts.
Thank you in advance.

One way to do it with ave is
DF_A$ID <- ave(DF_A$Group_2, DF_A$Group_1, FUN = function(x) match(x, unique(x)))
DF_A
# Group_1 Group_2 ID
#1 A A 1
#2 A B 2
#3 A C 3
#4 A A 1
#5 A B 2
#6 B A 1
#7 B B 2
#8 B A 1
#9 B C 3
#10 C A 1
The equivalent dplyr way is :
library(dplyr)
DF_A %>%
group_by(Group_1) %>%
mutate(ID = match(Group_2, unique(Group_2)))

You can split into groups by Group_1, then create factor out of your combinations within each group then convert into integer
DF_A$ID <- unlist(by(DF_A, DF_A$Group_1, function(x) as.integer(factor(x$Group_2))))

We can use the dense_rank from dplyr.
library(dplyr)
DF_A2 <- DF_A %>%
group_by(Group_1) %>%
mutate(ID = dense_rank(Group_2)) %>%
ungroup()
DF_A2
# # A tibble: 10 x 3
# Group_1 Group_2 ID
# <fct> <fct> <int>
# 1 A A 1
# 2 A B 2
# 3 A C 3
# 4 A A 1
# 5 A B 2
# 6 B A 1
# 7 B B 2
# 8 B A 1
# 9 B C 3
# 10 C A 1

You could use the integer values of the factor levels. We can simply wrap Group_2 in c() to drop the factor attribute.
within(DF_A, { ID = ave(c(Group_2), Group_1, FUN = c) })
# Group_1 Group_2 ID
# 1 A A 1
# 2 A B 2
# 3 A C 3
# 4 A A 1
# 5 A B 2
# 6 B A 1
# 7 B B 2
# 8 B A 1
# 9 B C 3
# 10 C A 1

Get a unique hash value out of a combination of columns

I have a data.table with 4+ columns. The first 3 are necessary to get the data about one unique individual.
c1 c2 c3 c4
a c e other_data
a c e other_data
a c f other_data
a c f other_data
a d f other_data
b d g other_data
# (c1 = "a" AND c2 = "c" AND c3 = "e") => one individual
# (c1 = "a" AND c2 = "c" AND c3 = "f") => another individual
I'd like to compute another column which would mark each individual :
c1 c2 c3 c4 unique_individual_id
a c e other_data 1
a c e other_data 1
a c f other_data 2
a c f other_data 2
a d f other_data 3
b d g other_data 4
I would like to get a unique hash out of the content of the 3 columns.
How would I do that in code ?

as.numeric(as.factor(with(df, paste(c1, c2, c3))))
#[1] 1 1 2 2 3 4

We can use interaction to create the unique index
df1$unique_individual_id <- as.integer(do.call(interaction, c(df1[-4], drop = TRUE)))
df1$unique_individual_id
#[1] 1 1 2 2 3 4

Alternatively, you can paste the values of interest (for each row, you paste together the values in columns 1, 2, and 3), convert to factor and then to integer (this will return an unique ID num for your combination.
df <- data.frame(c("a", "a", "b", "c", "c", "d", "d"),
c("a", "a", "b", "c", "d", "e", "e"),
c("c", "c", "d", "d", "e", "e", "e"))
df$ID <- as.numeric(as.factor(sapply(1:nrow(df), (function(i) {paste(df[i, 1:3], collapse = "")}))))

R: unique combination (avoid a-b and b-a and identical such as a-a, b-b)

I have the following variable columns -
var1 <- c("a", "b", "a", "a", "c", "a", "b", "b", "c", "b", "c", "c", "d")
var2 <- c("a", "a", "b", "c", "a", "d", "b", "c", "b", "d", "c", "d", "d")
mydf <- data.frame(var1, var2)
I want to find unique variable combination, such that
(a) var1 a- var2 b and var1 b- var2 a are not considered unique.
(b) no identical combination are present -
for example var1 a and var2 a, var1 b and var2 b
I used the following codes, is not providing what I am expecting:
unique(mydf)
var1 var2
1 a a
2 b a
3 a b
4 a c
5 c a
6 a d
7 b b
8 b c
9 c b
10 b d
11 c c
12 c d
13 d d
My expected output is:
var1 var2
1 a b
2 a c
3 a d
4 b c
5 b d
6 c d
thanks;

This should do it:
mydf = mydf[mydf[,1] != mydf[,2], ]
mydf = mydf[!duplicated(data.frame(t(apply(mydf, 1, sort)))), ]
> mydf
var1 var2
2 b a
4 a c
6 a d
8 b c
10 b d
12 c d

More of an exercise to teach myself some sets package behavior:
require(sets)
mydf <- data.frame(var1, var2, stringsAsFactors=FALSE) # unneeded factors are a plague on R/S
dlis <- list();
for (i in seq(nrow(mydf)) ) {
if( length(set(mydf[i,1], mydf[i,2]) )==2 ) {
dlis <- c( dlis, list(set(mydf[i,1], mydf[i,2]))
) } }
unique(dlis)
[[1]]
{"a", "b"}
[[2]]
{"a", "c"}
[[3]]
{"a", "d"}
[[4]]
{"b", "c"}
[[5]]
{"b", "d"}
[[6]]
{"c", "d"}

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Remove/collapse consecutive duplicate values in sequence - r

With base R, I like funny algorithmics: x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d") x[x!=c(x[-1], FALSE)] #[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"

Related

R create serial number based on two different columns [duplicate]

Matching rows to columns and counting same occurences R

Consecutive Across and Unique Number Within Group

Get a unique hash value out of a combination of columns

R: unique combination (avoid a-b and b-a and identical such as a-a, b-b)

Categories

Resources