R seems duplicated() to select the wrong duplicates - r

I've noticed a couple of times now that when I'm using R to identify duplicates, sometimes it seems to identify the wrong cases.
Here's a data frame that has three columns, each which may be holding duplicate values. I want to isolate the cases that are duplicates of another case on all three variables.
set.seed(100)
test <- data.frame(id = sample(1:15, 20, replace = TRUE),
cat1 = sample(letters[1:2], 20, replace = TRUE),
cat2 = sample(letters[1:2], 20, replace = TRUE))
Which gives me:
id cat1 cat2
1 5 b a
2 4 b b
3 9 b b
4 1 b b
5 8 a b
6 8 a a
7 13 b b
8 6 b b
9 9 b a
10 3 a a
11 10 a a
12 14 b a
13 5 a a
14 6 b a
15 12 b b
16 11 b a
17 4 a a
18 6 b a
19 6 b b
20 11 a a
I've tried this a couple of ways, such as:
duplicated(test$id) & duplicated(test$cat1) & duplicated(test$cat2)
But this just results in the same as duplicated(test$id):
[1] FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE FALSE
[17] TRUE TRUE TRUE TRUE
So instead I tried duplicated(test$id, test$cat1, test$cat2), which produces different results:
[1] TRUE TRUE TRUE FALSE TRUE FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE
[17] FALSE TRUE FALSE FALSE
But is still incorrect - if I call these cases from the data frame we get:
> test[which(duplicated(test$id, test$cat1, test$cat2)),]
id cat1 cat2
1 5 b a
2 4 b b
3 9 b b
5 8 a b
8 6 b b
14 6 b a
16 11 b a
18 6 b a
As you can see these are not the rows we should be getting (were it doing what I'd have thought it would do), which should be (as far as I can see):
18 6 b a
19 6 b b
Does anyone know why it's coming up with these results, and where I'm going wrong using it? Is there a simple (ideally non-verbose) way of doing this?

We need to apply duplicated on a data.frame or matrix or vector
i1 <- duplicated(test[c('id', 'cat1')])
i2 <- duplicated(cbind(test$id, test$cat1))
identical(i1, i2)
#[1] TRUE
and not on more than one data.frame or matrix or vector
i3 <- duplicated(test$id, test$cat1)
identical(i1, i3)
#[1] FALSE
It is specified in the documents of ?duplicated
duplicated(x, incomparables = FALSE, ...)
where
x a vector or a data frame or an array or NULL.
and not 'x1', 'x2', etc..
As #Aaron mentioned in the comments, to subset the duplicates from the OP's data
test[duplicated(test),]
and if we wanted only the duplicates, then
test[duplicated(test)|duplicated(test, fromLast = TRUE),]

Taking duplicates of columns separately is not the same as taking duplicates of a data frame or matrix. This example makes it more clear:
df = data.frame(x = c(1,2,1),
y = c(1,3,3))
df$dupe = duplicated(df$x) & duplicated(df$y)
df$dupe2 = duplicated(df[,c("x","y")])
df
Using your method, duplicated says "When I hit the third row, x already had a 1 so it's duplicated. y already had a 3 so it's duplicated." This doesn't mean that it already saw a row where x = 1 and y = 3.

Related

recoding based on two condtions in r

I have an example dataset looks like:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
> data
category
1 A
2 B
3 C
4 X1_theta
5 X2_theta
6 AB_theta
7 BC_theta
8 CD_theta
I am trying to generate a logical variable when the category (variable) contains "theta" in it. However, I would like to assign the logical value as "FALSE" when cell values contain "X1" and "X2".
Here is what I did:
data$logic <- str_detect(data$category, "theta")
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta TRUE
5 X2_theta TRUE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
here, all cells value that have "theta" have the logical value of "TRUE".
Then, I wrote this below to just assign "FALSE" when the cell value has "X" in it.
data$logic <- ifelse(grepl("X", data$category), "FALSE", "TRUE")
> data
category logic
1 A TRUE
2 B TRUE
3 C TRUE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
But this, of course, overwrote the previous application
What I would like to get is to combine two conditions:
> data
category logic
1 A FALSE
2 B FALSE
3 C FALSE
4 X1_theta FALSE
5 X2_theta FALSE
6 AB_theta TRUE
7 BC_theta TRUE
8 CD_theta TRUE
Any thoughts?
Thanks
We can create the 'logic', by detecting substring 'theta' at the end and not having 'X' ([^X]) as the starting (^) character
libary(dplyr)
library(stringr)
library(tidyr)
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"))
If we need to split the column into separate columns based on the conditions
data %>%
mutate(logic = str_detect(category, "^[^X].*theta$"),
category = case_when(logic ~ str_replace(category, "_", ","),
TRUE ~ as.character(category))) %>%
separate(category, into = c("split1", "split2"), sep= ",", remove = FALSE)
# category split1 split2 logic
#1 A A <NA> FALSE
#2 B B <NA> FALSE
#3 C C <NA> FALSE
#4 X1_theta X1_theta <NA> FALSE
#5 X2_theta X2_theta <NA> FALSE
#6 AB,theta AB theta TRUE
#7 BC,theta BC theta TRUE
#8 CD,theta CD theta TRUE
Or in base R
data$logic <- with(data, grepl("^[^X].*theta$", category))
Another option is to have two grepl condition statements
data$logic <- with(data, grepl("theta$", category) & !grepl("^X\\d+", category))
data$logic
#[1] FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Not the cleanest in the world (since it adds 2 unnecessary cols) but it gets the job done:
data <- as.data.frame(c("A","B","C","X1_theta","X2_theta","AB_theta","BC_theta","CD_theta"))
colnames(data) <- "category"
data$logic1 <- ifelse(grepl('X',data$category), FALSE, TRUE)
data$logic2 <- ifelse(grepl('theta',data$category),TRUE, FALSE)
data$logic <- ifelse((data$logic1 == TRUE & data$logic2 == TRUE), TRUE, FALSE)
print(data)
I think you can also remove the logic1 and logic2 cols if you want but I usually don't bother (I'm a messy coder haha).
Hope this helped!
EDIT: akrun's grepl solution does what I'm doing way more cleanly (as in, it doesn't require the extra cols). I definitely recommend that approach!

R: How to make extra rows from a column?

I have a data-set of human hands, where currently a single person is defined as a single observation. I want to reshape dataframe to have hands as individual observations. I tried something with "dplyr" package and "gather" function but had no success at all.
So from this, where each person is on one row :
id Gender Age Present_R Present_L Dominant
1 F 2 TRUE TRUE R
2 F 5 TRUE FALSE L
3 M 8 FALSE FALSE R
to this, where each hand is on one row:
id Gender Age Hand Present Dominant
1 F 2 R TRUE TRUE
2 F 2 L TRUE FALSE
3 F 5 R TRUE FALSE
4 F 5 L FALSE TRUE
5 M 8 R FALSE TRUE
6 M 8 L FALSE FALSE
Note that hand dominance becomes logical.
We can gather into 'long' format, arrange by 'id', then create the 'Dominant' by unlisting the 'Present' columns, 'Hand' by removing the substring of the 'Hand' column
library(tidyverse)
gather(df1, Hand, Present, Present_R:Present_L) %>%
arrange(id) %>%
mutate(Dominant = unlist(df1[c("Present_L", "Present_R")]),
id = row_number(),
Hand = str_remove(Hand, ".*_"))
# id Gender Age Dominant Hand Present
#1 1 F 2 TRUE R TRUE
#2 2 F 2 FALSE L TRUE
#3 3 F 5 FALSE R TRUE
#4 4 F 5 TRUE L FALSE
#5 5 M 8 TRUE R FALSE
#6 6 M 8 FALSE L FALSE
Based on the OP' comments, it seems like we need to compare the 'Dominant' with the 'Hand'
gather(df1, Hand, Present, Present_R:Present_L) %>%
arrange(id) %>%
mutate(id = row_number(),
Hand = str_remove(Hand, ".*_"),
Dominant = Dominant == Hand)
# id Gender Age Dominant Hand Present
#1 1 F 2 TRUE R TRUE
#2 2 F 2 FALSE L TRUE
#3 3 F 5 FALSE R TRUE
#4 4 F 5 TRUE L FALSE
#5 5 M 8 TRUE R FALSE
#6 6 M 8 FALSE L FALSE
With a small data frame (i.e., few variables, regardless of the number of cases), "hand-coding" may be the easiest approach:
with(df, data.frame(id = c(id,id), Gender=c(Gender,Gender), Age=c(Age, Age),
Hand = c(rep("R", nrow(df)), rep("L", nrow(df))),
Present = c(Present_R, Present_L),
Dominant = c(Dominant=="R", Dominant=="L")
))

Uncover values of one data frame with another mask data frame

Suppose I have two data frames A and B_mask, where
A <- as.data.frame( matrix(1:20,nrow=4) )
V1 V2 V3 V4 V5
1 1 5 9 13 17
2 2 6 10 14 18
3 3 7 11 15 19
4 4 8 12 16 20
And suppose also,
B_mask <- matrix(FALSE, nrow=4, ncol=5)
B_mask[2:3, 1:3] <- TRUE
B_mask <- as.data.frame(B_mask)
V1 V2 V3 V4 V5
1 FALSE FALSE FALSE FALSE FALSE
2 TRUE TRUE TRUE FALSE FALSE
3 TRUE TRUE TRUE FALSE FALSE
4 FALSE FALSE FALSE FALSE FALSE
How does one get a result data frame such that:
If an entry in B_mask is equal to TRUE, uncover the corresponding value in A? For example, because B_mask[2,1] = TRUE, I would want result[2,1] = A[2,1] = 2.
If an entry in B_mask is equal to FALSE, cover the corresponding value in A as NA? For example, because B_mask[3,4] = FALSE, I would want result[3,4] = NA.
Thanks!
We create a copy of the dataset ('res'), convert the 'FALSE' to NA in 'B_mask', and use the logical index to subset the corresponding values of 'A' and assign the output back to 'res' with the structure intact ([])
res <- A
res[] <- as.matrix(A)[as.logical(NA^!B_mask)]
Or as #alexis_laz mentioned this can also done with
is.na(res) <- !as.matrix(B_mask)

Collapse consecutive rows in a data frame

I have this example data.frame:
df <- data.frame(a = c(1,2,3,5,7,8),b=c(2,3,4,6,8,9))
And I'd like to collapse all rows i whose b column value is equal to a column value at their subsequent row (i+1) such that in the collapsed row they their a column will be that of row i and their b column will be that of row i+1. This has to be done as long as there are no consecutive rows that meet this condition.
For the example df rows 1-3 are to be collapsed, row 4 left as is, and then rows 5-6 collapsed, giving:
res.df <- data.frame(a = c(1,5,7), b = c(4,6,9))
This isn't overly pretty, but it is vectorised comparing a cutdown version of df$a to df$b.
grps <- rev(cumsum(rev(c(tail(df$a,-1) != head(df$b,-1),TRUE))))
#[1] 3 3 3 2 1 1
cbind(df["a"], b=ave(df$b,grps,FUN=max) )[!duplicated(grps),]
# a b
#1 1 4
#4 5 6
#5 7 9
Breaking it down probably helps explain the first part:
tail(df$a,-1) != head(df$b,-1)
#[1] FALSE FALSE TRUE TRUE FALSE
c(tail(df$a,-1) != head(df$b,-1),TRUE)
#[1] FALSE FALSE TRUE TRUE FALSE TRUE
rev(c(tail(df$a,-1) != head(df$b,-1),TRUE))
#[1] TRUE FALSE TRUE TRUE FALSE FALSE
cumsum(rev(c(tail(df$a,-1) != head(df$b,-1),TRUE)))
#[1] 1 1 2 3 3 3

Reshape data frame to convert factors into columns in R

I have a data frame where one particular column has a set of specific values (let's say, 1, 2, ..., 23). What I would like to do is to convert from this layout to the one, where the frame would have extra 23 (in this case) columns, each one representing one of the factor values. The data in these columns would be booleans indicating whether a particular row had a given factor value... To show a specific example:
Source frame:
ID DATE SECTOR
123 2008-01-01 1
456 2008-01-01 3
789 2008-01-02 5
... <more records with SECTOR values from 1 to 5>
Desired format:
ID DATE SECTOR.1 SECTOR.2 SECTOR.3 SECTOR.4 SECTOR.5
123 2008-01-01 T F F F F
456 2008-01-01 F F T F F
789 2008-01-02 F F F F T
I have no problem doing it in a loop but I hoped there would be a better way. So far reshape() didn't yield the desired result. Help would be much appreciated.
I would try to bind another column called "value" and set value = TRUE.
df <- data.frame(cbind(1:10, 2:11, 1:3))
colnames(df) <- c("ID","DATE","SECTOR")
df <- data.frame(df, value=TRUE)
Then do a reshape:
reshape(df, idvar=c("ID","DATE"), timevar="SECTOR", direction="wide")
The problem with using the reshape function is that the default for missing values is NA (in which case you will have to iterate and replace them with FALSE).
Otherwise you can use cast out of the reshape package (see this question for an example), and set the default to FALSE.
df.wide <- cast(df, ID + DATE ~ SECTOR, fill=FALSE)
> df.wide
ID DATE 1 2 3
1 1 2 TRUE FALSE FALSE
2 2 3 FALSE TRUE FALSE
3 3 4 FALSE FALSE TRUE
4 4 5 TRUE FALSE FALSE
5 5 6 FALSE TRUE FALSE
6 6 7 FALSE FALSE TRUE
7 7 8 TRUE FALSE FALSE
8 8 9 FALSE TRUE FALSE
9 9 10 FALSE FALSE TRUE
10 10 11 TRUE FALSE FALSE
Here's another approach using xtabs which may or may not be faster (if someone would try and let me know):
df <- data.frame(cbind(1:12, 2:13, 1:3))
colnames(df) <- c("ID","DATE","SECTOR")
foo <- xtabs(~ paste(ID, DATE) + SECTOR, df)
cbind(t(matrix(as.numeric(unlist(strsplit(rownames(foo), " "))), nrow=2)), foo)

Resources