I have this example data.frame:
df <- data.frame(a = c(1,2,3,5,7,8),b=c(2,3,4,6,8,9))
And I'd like to collapse all rows i whose b column value is equal to a column value at their subsequent row (i+1) such that in the collapsed row they their a column will be that of row i and their b column will be that of row i+1. This has to be done as long as there are no consecutive rows that meet this condition.
For the example df rows 1-3 are to be collapsed, row 4 left as is, and then rows 5-6 collapsed, giving:
res.df <- data.frame(a = c(1,5,7), b = c(4,6,9))
This isn't overly pretty, but it is vectorised comparing a cutdown version of df$a to df$b.
grps <- rev(cumsum(rev(c(tail(df$a,-1) != head(df$b,-1),TRUE))))
#[1] 3 3 3 2 1 1
cbind(df["a"], b=ave(df$b,grps,FUN=max) )[!duplicated(grps),]
# a b
#1 1 4
#4 5 6
#5 7 9
Breaking it down probably helps explain the first part:
tail(df$a,-1) != head(df$b,-1)
#[1] FALSE FALSE TRUE TRUE FALSE
c(tail(df$a,-1) != head(df$b,-1),TRUE)
#[1] FALSE FALSE TRUE TRUE FALSE TRUE
rev(c(tail(df$a,-1) != head(df$b,-1),TRUE))
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
cumsum(rev(c(tail(df$a,-1) != head(df$b,-1),TRUE)))
#[1] 1 1 2 3 3 3
Related
I have those two df's:
ID1 <- c("TRZ00897", "AAR9832", "NZU44447683209", "sxc89898989M", "RSU765th89", "FFF")
Date1 <- c("2022-08-21","2022-03-22","2022-09-24", "2022-09-21", "2022-09-22", "2022-09-22")
Data1 <- data.frame(ID1,Date1)
ID <- c("RSU765th89", "NZU44447683209", "AAR9832", "TRZ00897","ERD895655", "FFF", "IUHG0" )
Date <- c("2022-09-22","2022-09-21", "2022-03-22", "2022-08-21", "2022-09-21", "2022-09-22", "2022-09-22" )
Data2 <- data.frame(ID,Date)
I tried to get exact matches. An exact match is if ID and Date are the same in both df's, for example: "TRZ00897" "2022-08-21" is an exact match, because it is present in both df's
With the following line of code:
match(Data1$ID1, Data2$ID) == match(Data1$Date1, Data2$Date)
the output is:
TRUE TRUE NA NA TRUE FALSE
Obviously the last one should not be FALSE because "FFF" "2022-09-22" is in both df. The reason why it is FALSE is, that the Date"2022-09-22" occurred already in Data2 at index position 1.
match(Data1$ID1, Data2$ID)
4 3 2 NA 1 6
match(Data1$Date1, Data2$Date)
4 3 NA 2 1 1
So at the end, there is index position 6 and 1 which is not equal --> FALSE
How can I change this? Which function should I use to get the correct answer.
Note, I don't need to merge or join etc. I'm really looking for a function that can detect those patterns.
Combine the columns then match:
match(paste(Data1$ID1, Data1$Date1), paste(Data2$ID, Data2$Date))
# [1] 4 3 NA NA 1 6
To get logical outut use %in%:
paste(Data1$ID1, Data1$Date1) %in% paste(Data2$ID, Data2$Date)
# [1] TRUE TRUE FALSE FALSE TRUE TRUE
Try match with asplit (since you have different column names for two dataframes, I have to manually remove the names using unname, which can be avoided if both of them have the same names)
> match(asplit(unname(Data1), 1), asplit(unname(Data2), 1))
[1] 4 3 NA NA 1 6
Another option that is memory-expensive option is using interaction
> match(interaction(Data1), interaction(Data2))
[1] 4 3 NA NA 1 6
With mapply and %in%:
apply(mapply(`%in%`, Data1, Data2), 1, all)
[1] TRUE TRUE FALSE FALSE TRUE TRUE
rowSums(mapply(`%in%`, Data1, Data2)) == ncol(Data1)
Edit; for a subset of columns:
idx <- c(1, 2)
apply(mapply(`%in%`, Data1[idx], Data2[idx]), 1, all)
#[1] TRUE TRUE FALSE FALSE TRUE TRUE
This question already has answers here:
Split string column to create new binary columns
(10 answers)
Closed 1 year ago.
Is there a way to split data based on value of character in dataframe into multiple columns, so that for example, I start with this data frame
initialData = data.frame(attr = c('a','b','c','d'), type=c('1,2','2','3','2,3'))
And the endData is something like this:
attr Conditions Cond1 Cond2 Cond3
1 a 1,2 TRUE TRUE FALSE
2 b 2 FALSE TRUE FALSE
3 c 3 FALSE FALSE TRUE
4 d 2,3 FALSE TRUE TRUE
I've written a function that takes in a character, does a regexp on it to see if the condition is met and then returns true or false, but I'm not sure how to go through each line in the data frame and add to the correct column
We can use mtabulate from qdapTools after splitting the 'type' column with strsplit and cbind with the original dataset
library(qdapTools)
out <- cbind(initialData,
mtabulate(strsplit(as.character(initialData$type), ",")) > 0)
names(out)[3:5] <- paste0("Cond", names(out)[3:5])
out
# attr type Cond1 Cond2 Cond3
#1 a 1,2 TRUE TRUE FALSE
#2 b 2 FALSE TRUE FALSE
#3 c 3 FALSE FALSE TRUE
#4 d 2,3 FALSE TRUE TRUE
I have an example dataset looks like:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
> data
category
1 A
2 B
3 C
4 X1_theta
5 X2_theta
6 AB_theta
7 BC_theta
8 CD_theta
I am trying to generate a logical variable when the category (variable) contains "theta" in it. However, I would like to assign the logical value as "FALSE" when cell values contain "X1" and "X2".
Here is what I did:
data$logic <- str_detect(data$category, "theta")
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta TRUE
5 X2_theta TRUE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
here, all cells value that have "theta" have the logical value of "TRUE".
Then, I wrote this below to just assign "FALSE" when the cell value has "X" in it.
data$logic <- ifelse(grepl("X", data$category), "FALSE", "TRUE")
> data
category logic
1 A TRUE
2 B TRUE
3 C TRUE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
But this, of course, overwrote the previous application
What I would like to get is to combine two conditions:
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
Any thoughts?
Thanks
We can create the 'logic', by detecting substring 'theta' at the end and not having 'X' ([^X]) as the starting (^) character
libary(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"))
If we need to split the column into separate columns based on the conditions
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"),
category = case_when(logic ~ str_replace(category, "_", ","),
TRUE ~ as.character(category))) %>%
separate(category, into = c("split1", "split2"), sep= ",", remove = FALSE)
# category split1 split2 logic
#1 A A <NA> FALSE
#2 B B <NA> FALSE
#3 C C <NA> FALSE
#4 X1_theta X1_theta <NA> FALSE
#5 X2_theta X2_theta <NA> FALSE
#6 AB,theta AB theta TRUE
#7 BC,theta BC theta TRUE
#8 CD,theta CD theta TRUE
Or in base R
data$logic <- with(data, grepl("^[^X].*theta$", category))
Another option is to have two grepl condition statements
data$logic <- with(data, grepl("theta$", category) & !grepl("^X\\d+", category))
data$logic
#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Not the cleanest in the world (since it adds 2 unnecessary cols) but it gets the job done:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
data$logic1 <- ifelse(grepl('X',data$category), FALSE, TRUE)
data$logic2 <- ifelse(grepl('theta',data$category),TRUE, FALSE)
data$logic <- ifelse((data$logic1 == TRUE & data$logic2 == TRUE), TRUE, FALSE)
print(data)
I think you can also remove the logic1 and logic2 cols if you want but I usually don't bother (I'm a messy coder haha).
Hope this helped!
EDIT: akrun's grepl solution does what I'm doing way more cleanly (as in, it doesn't require the extra cols). I definitely recommend that approach!
I've started learning R and got a piece of code in which a statement is:
if(sum(C == C[i]) == 1)# C is simply a vector and i is index of a value in this vector which the user specifies in an argument.
How can you pass a conditional statement as an argument of a function? Also explain the meaning of this statement.
Thank you.
Let's take an example to understand
Consider C as a numeric vector from 1 to 10 and let's take i as 3
C <- 1:10
i <- 3
So when we do
C == C[i]
#[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
it compares every element of C with C[i] which is 3 and returns a corresponding logical vector which is only TRUE at 3rd index.
When we sum this logical vector it returns count of all TRUE (as it considers FALSE as 0 and TRUE as 1) values which in this case is 1
sum(C == C[i])
#[1] 1
which is then compared to 1 again to make sure that there is only one C[i] in C
sum(C == C[i]) == 1
#[1] TRUE
This will fail in case if we have repeated numbers in C. For example,
C <- c(1:10, 3) #Adding an extra 3 in the end
C
#[1] 1 2 3 4 5 6 7 8 9 10 3
i <- 3
sum(C == C[i]) == 1
#[1] FALSE
The bottom line is the condition is TRUE if C[i] occurs only once in C.
I've noticed a couple of times now that when I'm using R to identify duplicates, sometimes it seems to identify the wrong cases.
Here's a data frame that has three columns, each which may be holding duplicate values. I want to isolate the cases that are duplicates of another case on all three variables.
set.seed(100)
test <- data.frame(id = sample(1:15, 20, replace = TRUE),
cat1 = sample(letters[1:2], 20, replace = TRUE),
cat2 = sample(letters[1:2], 20, replace = TRUE))
Which gives me:
id cat1 cat2
1 5 b a
2 4 b b
3 9 b b
4 1 b b
5 8 a b
6 8 a a
7 13 b b
8 6 b b
9 9 b a
10 3 a a
11 10 a a
12 14 b a
13 5 a a
14 6 b a
15 12 b b
16 11 b a
17 4 a a
18 6 b a
19 6 b b
20 11 a a
I've tried this a couple of ways, such as:
duplicated(test$id) & duplicated(test$cat1) & duplicated(test$cat2)
But this just results in the same as duplicated(test$id):
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[17] TRUE TRUE TRUE TRUE
So instead I tried duplicated(test$id, test$cat1, test$cat2), which produces different results:
[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
[17] FALSE TRUE FALSE FALSE
But is still incorrect - if I call these cases from the data frame we get:
> test[which(duplicated(test$id, test$cat1, test$cat2)),]
id cat1 cat2
1 5 b a
2 4 b b
3 9 b b
5 8 a b
8 6 b b
14 6 b a
16 11 b a
18 6 b a
As you can see these are not the rows we should be getting (were it doing what I'd have thought it would do), which should be (as far as I can see):
18 6 b a
19 6 b b
Does anyone know why it's coming up with these results, and where I'm going wrong using it? Is there a simple (ideally non-verbose) way of doing this?
We need to apply duplicated on a data.frame or matrix or vector
i1 <- duplicated(test[c('id', 'cat1')])
i2 <- duplicated(cbind(test$id, test$cat1))
identical(i1, i2)
#[1] TRUE
and not on more than one data.frame or matrix or vector
i3 <- duplicated(test$id, test$cat1)
identical(i1, i3)
#[1] FALSE
It is specified in the documents of ?duplicated
duplicated(x, incomparables = FALSE, ...)
where
x a vector or a data frame or an array or NULL.
and not 'x1', 'x2', etc..
As #Aaron mentioned in the comments, to subset the duplicates from the OP's data
test[duplicated(test),]
and if we wanted only the duplicates, then
test[duplicated(test)|duplicated(test, fromLast = TRUE),]
Taking duplicates of columns separately is not the same as taking duplicates of a data frame or matrix. This example makes it more clear:
df = data.frame(x = c(1,2,1),
y = c(1,3,3))
df$dupe = duplicated(df$x) & duplicated(df$y)
df$dupe2 = duplicated(df[,c("x","y")])
df
Using your method, duplicated says "When I hit the third row, x already had a 1 so it's duplicated. y already had a 3 so it's duplicated." This doesn't mean that it already saw a row where x = 1 and y = 3.