Delete rows after a certain sequence of values in a certain column

Delete rows after a certain sequence of values in a certain column - r

a <- c("A","A","A","B","B","B","C","C","C","C","D","D","D","D","D")
b <- c("x","y","z","x","x","z","y","z","z","z","y","z","z","z","x")
df = data.frame(a,b)
a b
1 A x
2 A y
3 A z
4 B x
5 B x
6 B z
7 C y
8 C z
9 C z
10 C z
11 D y
12 D z
13 D z
14 D z
15 D x
For every group A, B, C, D, I'd like to delete the value z in column b every time the combination y,z appears at the end of the group.
If we have the case of a=="C", where the b-values are y,z,z,z, I'd like to delete all z's. However, in a=="D", nothing has to change as x is the last value.
The results looks like this:
a b
1 A x
2 A y
4 B x
5 B x
6 B z
7 C y
11 D y
12 D z
13 D z
14 D z
15 D x
By grouping in dplyr, I can identify the last occurence of each value in A, so the basic case depictured in a=="A"is not a problem. I have trouble finding a solution for the case of a=="C", where I could have one occurence of y followed by 20 occurences of z.

You can use by and cummin in base R:
df[unlist(by(df$b, interaction(df$a), FUN = function(x) {
tmp <- rev(cummin(rev(x == "z")))
if (tail(x[!tmp], 1) == "y") !tmp else rep(TRUE, length(x))
})), ]
The result:
a b
1 A x
2 A y
4 B x
5 B x
6 B z
7 C y
11 D y
12 D z
13 D z
14 D z
15 D x

Here's a possible data.table solution. Basically, I'm creating an logical index that satisfies 3 conditions at once: being a z, that the first z comes after y and that the last value is z and then I'm just evaluating it.
library(data.table)
setDT(df)[, indx := b == "z" &
max(which(b == "z")) == .N &
ifelse(min(which(b == "z")) == 1L,
TRUE,
b[min(which(b == "z")) - 1L] == "y"),
by = a][!(indx)]
# a b indx
# 1: A x FALSE
# 2: A y FALSE
# 3: B x FALSE
# 4: B x FALSE
# 5: B z FALSE
# 6: C y FALSE
# 7: D y FALSE
# 8: D z FALSE
# 9: D z FALSE
# 10: D z FALSE
# 11: D x FALSE

Here's a base solution:
do.call("rbind", by(df, df$a, FUN = function(x) {
if(x$b[length(x$b)] == "z") {
y <- which(x$b == "y")
if(!length(y)) {
return(x)
}
z <- which(x$b == "z")
if(!length(z)) {
return(x)
}
# check if y isn't immediately before z
if(max(y) - min(z) > 1) {
return(x)
} else {
return(x[-z,])
}
} else {
return(x)
}
}))
And the result:
a b
A.1 A x
A.2 A y
B.4 B x
B.5 B x
B.6 B z
C C y
D.11 D y
D.12 D z
D.13 D z
D.14 D z
D.15 D x

Not so efficient, but works fine:
require(stringr)
df2 <- data.frame(row.names = c("a", "b"))
for(i in levels(factor(df$a))) {
temp <- paste(df$b[df$a == i], collapse = "")
if(str_detect(temp, "yz") & str_detect(temp, "z$")) {
temp <- gsub("z", "", temp)
df2 <- rbind(df2, data.frame(a = rep(i, nchar(temp)), b = substring(temp, seq(1,nchar(temp),1), seq(1,nchar(temp),1))))
} else df2 <- rbind(df2, data.frame(a = rep(i, nchar(temp)), b = substring(temp, seq(1,nchar(temp),1), seq(1,nchar(temp),1))))
}
# a b
# 1 A x
# 2 A y
# 3 B x
# 4 B x
# 5 B z
# 6 C y
# 7 D y
# 8 D z
# 9 D z
# 10 D z
# 11 D x

Related

R collapse duplicate pairs (in any order) across dataframe columns and edit 3rd column?

I used rbind to join 2 dataframes, with a column denoting its source, resulting in
from | to | source
1 A B X
2 C D Y
3 B A Y
...
I would like to look for overlapping pairs, regardless of "order", combine those pairs, then edit the source column to something else, e.g. "Z".
In the above example, rows 1 and 3 would be flagged as overlapping, so they will be combined and modified.
So the desired output would look something like
from | to | source
1 A B Z
2 C D Y
...
How can this be done?

You can try the code below
unique(
transform(
transform(
df,
from = pmin(from, to),
to = pmax(from, to)
),
source = ave(source, from, to, FUN = function(x) ifelse(length(x) > 1, "Z", x))
)
)
which gives
from to source
1 A B Z
2 C D Y

Example
set.seed(1)
df=data.frame(
"from"=sample(LETTERS[1:4],10,replace=T),
"to"=sample(LETTERS[1:4],10,replace=T),
"source"=sample(c("X","Y"),10,replace=T)
)
from to source
1 A C X
2 D C X
3 C A X
4 A A X
5 B A X
6 A B X
7 C B Y
8 C B X
9 B B X
10 B C Y
and then
tmp=t(
apply(df,1,function(x){
sort(x[1:2])
})
)
t1=duplicated(tmp,fromLast=F)
t2=duplicated(tmp,fromLast=T)
df[t2,"source"]="Z"
df[!t1,]
from to source
1 A C Z
2 D C X
4 A A X
5 B A Z
7 C B Z
9 B B X

Flag based on multiple conditions

Being this my initial dataset:
x <- c("a","a","b","b","c","c","d","d")
y <- c("a","a","a","b","c","c", "d", "d")
z <- c(5,1,2,6,1,1,5,6)
df <- data.frame(x,y,z)
I am trying to create a column in a dataframe to flag if there is another row in the dataset with the following condition:
There is a row in the dataset with the same "x" and "y" columns. And at least 1 of the rows of the dataset, with that "x" and "y" has a "z" value >= 5
With the example provided, the output should be:
x y z flag
1 a a 5 TRUE
2 a a 1 TRUE
3 b a 2 FALSE
4 b b 6 TRUE
5 c c 1 FALSE
6 c c 1 FALSE
7 d d 5 TRUE
8 d d 6 TRUE
Thank you!

I use data.table package for all my aggregations. With this package I would do the following:
library(data.table)
dt <- as.data.table(df)
# by=.(x, y): grouping by x and y
# find all cases where
# 1. the maximum z value is >= 5
# 2. there are more than 1 entry for that (x, y) combo. .N is a data.table syntax for number of rows in that group
# := is a data.table syntax to assign back in to the original data.table
dt[, flag := max(z) >= 5 & .N > 1, by=.(x, y)]
# Does x need to equal y? If so use this
dt[, flag := max(z) >= 5 & .N > 1 & x == y, by=.(x, y)]
# view the result
dt[]
# return back to df
df <- as.data.frame(dt)
df

You can try the code below
> within(df, flag <- x==y & z>=5)
x y z flag
1 a a 5 TRUE
2 a a 1 FALSE
3 b a 2 FALSE
4 b b 6 TRUE
5 c c 1 FALSE
6 c c 1 FALSE
7 d d 5 TRUE
8 d d 6 TRUE

How to efficiently swap elements in a dataframe?

Suppose that we have the following dataframe:
set.seed(1)
(tmp <- data.frame(x = 1:10, R1 = sample(LETTERS[1:5], 10, replace =
TRUE), R2 = sample(LETTERS[1:5], 10, replace = TRUE)))
x R1 R2
1 1 B B
2 2 B A
3 3 C D
4 4 E B
5 5 B D
6 6 E C
7 7 E D
8 8 D E
9 9 D B
10 10 A D
I want to do the following: if the difference between the level index
of factor R1 and that of factor R2 is an odd number, the levels of the
two factors need to be switched between them, which can be performed
through the following code:
for(ii in 1:dim(tmp)[1]) {
kk <- which(levels(tmp$R2) %in% tmp[ii,'R2'], arr.ind = TRUE) -
which(levels(tmp$R1) %in% tmp[ii,'R1'], arr.ind = TRUE)
if(kk%%2!=0) { # swap the their levels between the two factors
qq <- tmp[ii,]$R1
tmp[ii,]$R1 <- tmp[ii,]$R2
tmp[ii,]$R2 <- qq
}
}
More concise and efficient ways to achieve this?
P.S. A slightly different situation is the following.
set.seed(1)
(tmp <- data.frame(x = 1:10, R1 = sample(LETTERS[1:5], 10, replace =
TRUE), R2 = sample(LETTERS[2:6], 10, replace = TRUE)))
x R1 R2
1 C B
2 B B
3 C E
4 E C
5 E B
6 D E
7 E E
8 D F
9 C D
10 A E
Notice that the factor levels between the two factors, R1 and R2, slide by one level; that is, factor R1 does not have level F while factor R2 does not have level A. I want to swap the factor levels based on the combined levels of the two factors as shown below:
tl <- unique(c(levels(tmp$R1), levels(tmp$R2)))
for(ii in 1:dim(tmp)[1]) {
kk <- which(tl %in% tmp[ii,'R2'], arr.ind = TRUE) - which(tl %in%
tmp[ii,'R1'], arr.ind = TRUE)
if(kk%%2!=0) { # swap the their levels between the two factors
qq <- tmp[ii,]$R1
tmp[ii,]$R1 <- tmp[ii,]$R2
tmp[ii,]$R2 <- qq
}
}
How to go about this case? Thanks!

#Find out the indices where difference is odd
inds = abs(as.numeric(tmp$R1) - as.numeric(tmp$R2)) %% 2 != 0
#create new columns where values for the appropriate inds are from relevant columns
tmp$R1_new = replace(tmp$R1, inds, tmp$R2[inds])
tmp$R2_new = replace(tmp$R2, inds, tmp$R1[inds])
tmp
# x R1 R2 R1_new R2_new
#1 1 B B B B
#2 2 B A A B
#3 3 C D D C
#4 4 E B B E
#5 5 B D B D
#6 6 E C E C
#7 7 E D D E
#8 8 D E E D
#9 9 D B D B
#10 10 A D D A
Delete the old R1 and R2 if necessary

A solution using dplyr. dt is the final output. Notice that we need to use if_else from dplyr here, not the common ifelse from base R.
library(dplyr)
dt <- tmp %>%
mutate(R1_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R2, R1),
R2_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R1, R2)) %>%
select(x, R1 = R1_new, R2 = R2_new)
Update
For the updated case, add one mutate call to redefine the factor level of R1 and R2. The rest is the same.
tl <- unique(c(levels(tmp$R1), levels(tmp$R2)))
dt <- tmp %>%
mutate(R1 = factor(R1, levels = tl), R2 = factor(R2, levels = tl)) %>%
mutate(R1_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R2, R1),
R2_new = if_else((as.numeric(R2) - as.numeric(R1)) %% 2 != 0, R1, R2)) %>%
select(x, R1 = R1_new, R2 = R2_new)

Here is an option using data.table
library(data.table)
setDT(tmp)[(as.integer(R1) - as.integer(R2))%%2 != 0, c('R2', 'R1') := .(R1, R2)]
tmp
# x R1 R2
#1: 1 B B
#2: 2 A B
#3: 3 D C
#4: 4 B E
#5: 5 B D
#6: 6 E C
#7: 7 D E
#8: 8 E D
#9: 9 D B
#10:10 D A

data.table: “group counter” for a specific combination of columns

I would like to add a counter column in a data frame based on a set of identical rows. To do this, I used the package data.table. In my case, the comparison between rows need doing from the combination of columns "z" AND ("x" OR "y").
I tested:
DF[ , Index := .GRP, by = c("x","y","z") ]
but the result is the combination of "z" AND "x" AND "y".
How can I have the combination of "z" AND ("x" OR "y") ?
Here is a data example:
DF = data.frame(x=c("a","a","a","b","c","d","e","f","f"), y=c(1,3,2,8,8,4,4,6,0), z=c("M","M","M","F","F","M","M","F","F"))
DF <- data.table(DF)
I would like to have this output:
> DF
x y z Index
1: a 1 M 1
2: a 3 M 1
3: a 2 M 1
4: b 8 F 2
5: c 8 F 2
6: d 4 M 3
7: e 4 M 3
8: f 6 F 4
9: f 0 F 4

The new group starts if the value for z is changing or the values both for x and y are changing.
Try this example.
require(data.table)
DF <- data.table(x = c("a","a","a","b","c","d","e","f","f"),
y = c(1,3,2,8,8,4,4,6,0),
z=c("M","M","M","F","F","M","M","F","F"))
# The functions to compare if value is not equal with the previous value
is.not.eq.with.lag <- function(x) c(T, tail(x, -1) != head(x, -1))
DF[, x1 := is.not.eq.with.lag(x)]
DF[, y1 := is.not.eq.with.lag(y)]
DF[, z1 := is.not.eq.with.lag(z)]
DF
DF[, Index := cumsum(z1 | (x1 & y1))]
DF

I know a lot of people warn against a for loop in R, but in this instance I think it is a very direct way of approaching the problem. Plus, the result isn't growing in size so performance issues aren't a large issue. The for loop approach would be:
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
if (i == 1){
dt$grp[i] = 1
}
else {
if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
dt$grp[i] = dt$grp[i-1]
}else{
dt$grp[i] = dt$grp[i-1] + 1
}
}
}
Trying this on OPs original problem, the result is:
DF = data.frame(x=c("a","a","a","b","c","d","e","f","f"), y=c(1,3,2,8,8,4,4,6,0), z=c("M","M","M","F","F","M","M","F","F"))
dt <- data.table(DF)
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
if (i == 1){
dt$grp[i] = 1
}
else {
if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
dt$grp[i] = dt$grp[i-1]
}else{
dt$grp[i] = dt$grp[i-1] + 1
}
}
}
dt
x y z grp
1: a 1 M 1
2: a 3 M 1
3: a 2 M 1
4: b 8 F 2
5: c 8 F 2
6: d 4 M 3
7: e 4 M 3
8: f 6 F 4
9: f 0 F 4
Trying this on the data.table in #Frank's comment, gives the expected result as well:
dt<-data.table(x = c("b", "a", "a"), y = c(1, 1, 2), z = c("F", "F", "F"))
dt$grp <- rep(NA,nrow(dt))
for (i in 1:nrow(dt)){
if (i == 1){
dt$grp[i] = 1
}
else {
if(dt$z[i-1] == dt$z[i] & (dt$x[i-1] == dt$x[i] | dt$y[i-1] == dt$y[i])){
dt$grp[i] = dt$grp[i-1]
}else{
dt$grp[i] = dt$grp[i-1] + 1
}
}
}
dt
x y z grp
1: b 1 F 1
2: a 1 F 1
3: a 2 F 1

EDITED TO ADD: This solution is in some ways a more verbose version of the one advocated by djhurio above. I think it shows what is happening a bit more so I'll leave it.
I think this is a task easier to do if it is broken down a little bit. The below code creates TWO indices at first, one for changes in x (nested in z) and one for changes in y (nested in z). We then find the first row from each of these indices. Taking the cumulative sum of the case where both FIRST.x and FIRST.y is true should give your desired index.
library(data.table)
dt_example <- data.table(x = c("a","a","a","b","c","d","e","f","f"),
y = c(1,3,2,8,8,4,4,6,0),
z = c("M","M","M","F","F","M","M","F","F"))
dt_example[,Index_x := .GRP,by = c("z","x")]
dt_example[,Index_y := .GRP,by = c("z","y")]
dt_example[,FIRST.x := !duplicated(Index_x)]
dt_example[,FIRST.y := !duplicated(Index_y)]
dt_example[,Index := cumsum(FIRST.x & FIRST.y)]
dt_example
x y z Index_x Index_y FIRST.x FIRST.y Index
1: a 1 M 1 1 TRUE TRUE 1
2: a 3 M 1 2 FALSE TRUE 1
3: a 2 M 1 3 FALSE TRUE 1
4: b 8 F 2 4 TRUE TRUE 2
5: c 8 F 3 4 TRUE FALSE 2
6: d 4 M 4 5 TRUE TRUE 3
7: e 4 M 5 5 TRUE FALSE 3
8: f 6 F 6 6 TRUE TRUE 4
9: f 0 F 6 7 FALSE TRUE 4

This approach looks for changes in x & z | y & z. The extra columns are left in the data.table to show the calculations.
DF[, c("Ix", "Iy", "Iz", "dx", "dy", "min.change", "Index") :=
#Create index of values based on consecutive order
list(ix <- rleid(x), iy <- rleid(y), iz <- rleid(z),
#Determine if combinations of x+z OR y+z change
ix1 <- c(0, diff(rleid(ix+iz))),
iy1 <- c(0, diff(rleid(iy+iz))),
#Either combination is constant (no change)?
change <- pmin(ix1, iy1),
#New index based on change
cumsum(change) + 1
)]
x y z Ix Iy Iz dx dy min.change Index
1: a 1 M 1 1 1 0 0 0 1
2: a 3 M 1 2 1 0 1 0 1
3: a 2 M 1 3 1 0 1 0 1
4: b 8 F 2 4 2 1 1 1 2
5: c 8 F 3 4 2 1 0 0 2
6: d 4 M 4 5 3 1 1 1 3
7: e 4 M 5 5 3 1 0 0 3
8: f 6 F 6 6 4 1 1 1 4
9: f 0 F 6 7 4 0 1 0 4

Transforming data.frame in R

I have the following data frame:
foo <- data.frame( abs( cbind(rnorm(3),rnorm(3, mean=.8),rnorm(3, mean=.9),rnorm(3, mean=1))))
colnames(foo) <- c("w","x","y","z")
rownames(foo) <- c("n","q","r")
foo
# w x y z
# n 1.51550092 1.4337572 1.2791624 1.1771230
# q 0.09977303 0.8173761 1.6123402 0.1510737
# r 1.17083866 1.2469347 0.8712135 0.8488029
What I want to do is to change it into :
newdf
# 1 n w 1.51550092
# 2 q w 0.09977303
# 3 r w 1.17083866
# 4 n x 1.43375725
# 5 q x 0.81737606
# 6 r x 1.24693468
# 7 n y 1.27916241
# 8 q y 1.61234016
# 9 r y 0.87121353
# 10 n z 1.17712302
# 11 q z 0.15107369
# 12 r z 0.84880292
What's the way to do it?

There are several ways to do this. Here's one:
set.seed(1)
foo <- data.frame( abs( cbind(rnorm(3),
rnorm(3, mean=.8),
rnorm(3, mean=.9),
rnorm(3, mean=1))))
colnames(foo) <- c("w","x","y","z")
rownames(foo) <- c("n","q","r")
foo
# w x y z
# n 0.6264538 2.39528080 1.387429 0.6946116
# q 0.1836433 1.12950777 1.638325 2.5117812
# r 0.8356286 0.02046838 1.475781 1.3898432
data.frame(rows = row.names(foo), stack(foo))
# rows values ind
# 1 n 0.62645381 w
# 2 q 0.18364332 w
# 3 r 0.83562861 w
# 4 n 2.39528080 x
# 5 q 1.12950777 x
# 6 r 0.02046838 x
# 7 n 1.38742905 y
# 8 q 1.63832471 y
# 9 r 1.47578135 y
# 10 n 0.69461161 z
# 11 q 2.51178117 z
# 12 r 1.38984324 z

reshape2:::melt() is particularly well suited to this transformation:
library(reshape2)
foo <- cbind(ID=rownames(foo), foo)
melt(foo)
# Using ID as id variables
# ID variable value
# 1 n w 1.7337416
# 2 q w 0.5890877
# 3 r w 0.2245508
# 4 n x 0.5237346
# 5 q x 0.9320455
# 6 r x 0.8156573
# 7 n y 1.9287306
# 8 q y 1.1604229
# 9 r y 1.7631215
# 10 n z 0.3591350
# 11 q z 0.9740170
# 12 r z 0.5621968

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Delete rows after a certain sequence of values in a certain column - r

You can use by and cummin in base R: df[unlist(by(df$b, interaction(df$a), FUN = function(x) { tmp <- rev(cummin(rev(x == "z"))) if (tail(x[!tmp], 1) == "y") !tmp else rep(TRUE, length(x)) })), ] The result: a b 1 A x 2 A y 4 B x 5 B x 6 B z 7 C y 11 D y 12 D z 13 D z 14 D z 15 D x

Related

R collapse duplicate pairs (in any order) across dataframe columns and edit 3rd column?

Flag based on multiple conditions

How to efficiently swap elements in a dataframe?

data.table: “group counter” for a specific combination of columns

Transforming data.frame in R

Categories

Resources