identifying rows having common values in two columns - r

How to identify rows having same values in two columns (here: treatment, replicate) at least in another one row?
set.seed(0)
x <- rep(1:10, 4)
y <- sample(c(rep(1:10, 2)+rnorm(20)/5, rep(6:15, 2) + rnorm(20)/5))
treatment <- sample(gl(8, 5, 40, labels=letters[1:8]))
replicate <- sample(gl(8, 5, 40))
d <- data.frame(x=x, y=y, treatment=treatment, replicate=replicate)
table(d$treatment, d$replicate)
# 1 2 3 4 5 6 7 8
# a 1 0 0 1 1 2 0 0
# b 1 1 0 0 1 2 0 0
# c 0 0 0 0 2 0 1 2
# d 2 0 1 1 0 0 1 0
# e 0 2 1 1 0 0 0 1
# f 0 1 1 0 1 1 1 0
# g 0 1 0 2 0 0 1 1
# h 1 0 2 0 0 0 1 1
From the above output, my guess is that the output should contain 16 rows. Any idea how to achieve this?
Update:
d %>% group_by(treatment, replicate) %>% filter(n()>1)
# A tibble: 16 x 4
x y treatment replicate
<int> <dbl> <fctr> <fctr>
1 2 7.050445 h 3
2 5 1.840198 b 6
3 8 9.160838 d 1
4 9 4.254486 h 3
5 2 8.870106 g 4
6 4 7.821616 a 6
7 6 9.752492 e 2
8 7 9.988579 c 5
9 9 10.480931 c 8
10 1 2.770469 c 8
11 2 7.913338 e 2
12 3 13.743080 d 1
13 9 5.692010 b 6
14 10 11.100722 a 6
15 3 12.198432 g 4
16 5 5.955146 c 5
I have identified one approach where the results seem to satisfy the condition. Any other better solutions?

You can use duplicated as a condition:
dups <- d[which(duplicated(d[,c("treatment", "replicate")]) |
duplicated(d[ ,c("treatment", "replicate")], fromLast = TRUE)),]
>dups
x y treatment replicate
2 2 7.050445 h 3
5 5 1.840198 b 6
8 8 9.160838 d 1
9 9 4.254486 h 3
12 2 8.870106 g 4
14 4 7.821616 a 6
16 6 9.752492 e 2
17 7 9.988579 c 5
19 9 10.480931 c 8
21 1 2.770469 c 8
22 2 7.913338 e 2
23 3 13.743080 d 1
29 9 5.692010 b 6
30 10 11.100722 a 6
33 3 12.198432 g 4
35 5 5.955146 c 5

Related

Fill Column with 1 Until Value Found, Then Repeat Fill with 0 Until Value Found Again

I have the following data frame:
miniDF1 <- data.frame(Pred = c("A","A","B","A","B","B","C","A","B","C","A","A","A","A","B","A","C","B"))
Pred
1 A
2 A
3 B
4 A
5 B
6 B
7 C
8 A
9 B
10 C
11 A
12 A
13 A
14 A
15 B
16 A
17 C
18 B
I am trying to make a new column filled with 1's until "C" is found in Pred, and then fill the new column with 0's until the next "C" is found, and repeat as such until the end of the DF. I have tried the following:
miniDF1 <- miniDF1 %>%
mutate(Outcome = ifelse(str_detect(Pred, "C"), 1, 0)) %>%
fill(Outcome, .direction = 'up')
Pred Outcome
1 A 0
2 A 0
3 B 0
4 A 0
5 B 0
6 B 0
7 C 1
8 A 0
9 B 0
10 C 1
11 A 0
12 A 0
13 A 0
14 A 0
15 B 0
16 A 0
17 C 1
18 B 0
but this is only putting 1's in the same row where there are "C's" located.
This is how it is expected to look like:
miniDF2 <- data.frame(Pred = c("A","A","B","A","B","B","C","A","B","C","A","A","A","A","B","A","C","B"),
Outcome = c(1,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,0))
Pred Outcome
1 A 1
2 A 1
3 B 1
4 A 1
5 B 1
6 B 1
7 C 1
8 A 0
9 B 0
10 C 0
11 A 1
12 A 1
13 A 1
14 A 1
15 B 1
16 A 1
17 C 1
18 B 0
I can't figure out how to get the values to flip accordingly, but I thought that that's what the fill(Outcome, .direction = 'up') part of my code was intended to do.
miniDF1$Outcome <- cumsum(c(1, head(miniDF1$Pred == 'C', -1))) %% 2
miniDF1
Pred Outcome
1 A 1
2 A 1
3 B 1
4 A 1
5 B 1
6 B 1
7 C 1
8 A 0
9 B 0
10 C 0
11 A 1
12 A 1
13 A 1
14 A 1
15 B 1
16 A 1
17 C 1
18 B 0
IN tidyverse:
library(dplyr)
miniDF1 %>%
mutate(Outcome = cumsum(lag(Pred == 'C', default = TRUE)) %% 2)
Looks cumbersome but does what needed,
i1 <- which(miniDF1$Pred == 'C')
dd <- data.frame(v1 = c(1, 0),v2 = c(i1[1], abs(diff(i1)), nrow(miniDF1)-max(i1)))
rep(dd$v1, dd$v2)
#[1] 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0
Maybe wrap it in a function too,
fun1 <- function(x, val){
i1 <- which(x == val)
dd <- data.frame(v1 = c(1, 0),
v2 = c(i1[1], abs(diff(i1)), nrow(miniDF1)-max(i1)))
return(rep(dd$v1, dd$v2))
}
miniDF1$outcome <- fun1(miniDF1$Pred, 'C')
# Pred outcome
#1 A 1
#2 A 1
#3 B 1
#4 A 1
#5 B 1
#6 B 1
#7 C 1
#8 A 0
#9 B 0
#10 C 0
#11 A 1
#12 A 1
#13 A 1
#14 A 1
#15 B 1
#16 A 1
#17 C 1
#18 B 0
You may try using rleid like
miniDF1$Outcome <- ifelse(data.table::rleid(miniDF1$Pred == "C") %% 4 %in% c(1,2), 1, 0)
miniDF1
Pred Outcome
1 A 1
2 A 1
3 B 1
4 A 1
5 B 1
6 B 1
7 C 1
8 A 0
9 B 0
10 C 0
11 A 1
12 A 1
13 A 1
14 A 1
15 B 1
16 A 1
17 C 1
18 B 0
explanation
For easier comparison, let's try miniDF1$x <- rleid(miniDF1$Pred == "C").
Pred Outcome x
1 A 1 1
2 A 1 1
3 B 1 1
4 A 1 1
5 B 1 1
6 B 1 1
7 C 1 2
8 A 0 3
9 B 0 3
10 C 0 4
11 A 1 5
12 A 1 5
13 A 1 5
14 A 1 5
15 B 1 5
16 A 1 5
17 C 1 6
18 B 0 7
You can see that rleid's result change if C appears and next. Also, as you need to switch from 1 to 0 as C appears.
It means, if x is 1 and 2 or 3 and 4 or .... need to have 1, 0, 1, and so on. So I divide x with 4 and get repeated 1 and 2/ 3 and 0 values.

determining total number of times distinct values 0 or 1 or na in each column in a data frame in R

I have 15 columns and I want to group by values in each column by either 0 or 1 or na.
my dataset
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
NA,1.0,0.0,0.0,NA,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,NA,NA,NA,NA
1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,NA,0.0,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0
0.0,1.0,1.0,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
NA,NA,1.0,NA,NA,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
0.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,NA,NA,NA,NA,NA
I want output to be like:
A B C D E F G H I J K L M N O
0 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
1 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
NA 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
We can loop through the dataset and apply the table with useNA="always"
sapply(df1, table, useNA="always")
If there are only a particular value in a column, say 1, then convert it to factor with levels specified as 0 and 1
sapply(df1, function(x) table(factor(x, levels = 0:1), useNA = "always"))
# A B C D E F G H I J K L M N O
#0 4 3 8 7 17 15 14 11 14 12 12 10 8 11 9
#1 19 21 17 17 6 9 10 12 8 11 8 10 12 9 11
#<NA> 2 1 0 1 2 1 1 2 3 2 5 5 5 5 5

Fill in missing rows in R

Suppose I have a data frame which looks like this
ID A B C D Month
1 X M 5 1 3
1 X K 4 2 4
1 X K 3 7 5
1 X K 2 6 6
2 Y L 5 8 1
2 Y L 2 3 2
2 Y M 5 1 3
2 Y K 2 7 5
2 Y M 2 8 6
3 Z K 5 3 1
3 Z M 6 3 2
3 Z M 5 8 3
3 Z K 4 2 4
In this data ID and A are unique variables,
while B,C,D,Month can change their value
Month has 6 factor values from 1 to 6
B have 3 factor value from K,L,M
C,D can have any value.
I want this data to become like this
ID A B C D Month
1 X 0 0 0 1
1 X 0 0 0 2
1 X M 5 1 3
1 X K 4 2 4
1 X K 3 7 5
1 X K 2 6 6
2 Y L 5 8 1
2 Y L 2 3 2
2 Y M 5 1 3
2 Y 0 0 0 4
2 Y K 2 7 5
2 Y M 2 8 6
3 Z K 5 3 1
3 Z M 6 3 2
3 Z M 5 8 3
3 Z K 4 2 4
3 Z 0 0 0 5
3 Z 0 0 0 6
It should fill in the missing rows by keeping the unique variables values same and filling in the varying ones with zero.
I can use zoo library to fill in the missing values but how to fill in the complete missing rows?
Maybe something like this would work for your needs:
library(dplyr)
mydf %>%
full_join(expand.grid(ID = unique(mydf$ID), Month = 1:6)) %>%
group_by(ID) %>%
mutate(A = replace(A, is.na(A), unique(na.omit(A)))) %>%
arrange(ID, A, Month) %>%
replace(., is.na(.), 0)
# Joining by: c("ID", "Month")
# Source: local data frame [18 x 6]
# Groups: ID
#
# ID A B C D Month
# 1 1 X 0 0 0 1
# 2 1 X 0 0 0 2
# 3 1 X M 5 1 3
# 4 1 X K 4 2 4
# 5 1 X K 3 7 5
# 6 1 X K 2 6 6
# 7 2 Y L 5 8 1
# 8 2 Y L 2 3 2
# 9 2 Y M 5 1 3
# 10 2 Y 0 0 0 4
# 11 2 Y K 2 7 5
# 12 2 Y M 2 8 6
# 13 3 Z K 5 3 1
# 14 3 Z M 6 3 2
# 15 3 Z M 5 8 3
# 16 3 Z K 4 2 4
# 17 3 Z 0 0 0 5
# 18 3 Z 0 0 0 6
Here's a way using base R
frame <- expand.grid(ID = unique(dat$ID), Month = 1:6)
dat2 <- merge(dat, frame, by=c("ID", "Month"), all=TRUE)[, union(names(dat), names(frame))]
levels(dat2$B) <- c(levels(dat2$B), 0)
res <- lapply(split(dat2, dat2$ID), function(x) {
x$A[which(is.na(x$A))] <- unique(x$A)[!is.na(unique(x$A))]
x[is.na(x)] <- 0
x
})
do.call(rbind, res)
ID A B C D Month
1.1 1 X 0 0 0 1
1.2 1 X 0 0 0 2
1.3 1 X M 5 1 3
1.4 1 X K 4 2 4
1.5 1 X K 3 7 5
1.6 1 X K 2 6 6
2.7 2 Y L 5 8 1
2.8 2 Y L 2 3 2
2.9 2 Y M 5 1 3
2.10 2 Y 0 0 0 4
2.11 2 Y K 2 7 5
2.12 2 Y M 2 8 6
3.13 3 Z K 5 3 1
3.14 3 Z M 6 3 2
3.15 3 Z M 5 8 3
3.16 3 Z K 4 2 4
3.17 3 Z 0 0 0 5
3.18 3 Z 0 0 0 6

Insert new columns based on the union of colnames of two data frames

I want to write a R function to insert many 0 vectors into a existed data.frame. Here is the example:
Data.frame 1
A B C D
1 1 3 4 5
2 4 5 6 7
3 4 5 6 2
4 4 55 2 3
Data.frame 2
A B E X
11 5 1 5 5
22 44 55 9 6
33 12 4 2 4
44 9 7 4 2
Based on the union of two colnames (that is A,B,C,D,E, X), I want to update the two data frames like:
Data.frame 1 (new)
A B C D E X
1 1 3 4 5 0 0
2 4 5 6 7 0 0
3 4 5 6 2 0 0
4 4 55 2 3 0 0
Data.frame 2 (new)
A B C D E X
11 5 1 0 0 5 5
22 44 55 0 0 9 6
33 12 4 0 0 2 4
44 9 7 0 0 4 2
Thanks in advance.
Option 1 (Thanks #Jilber for the edits)
I'm assuming the order of columns don't matter -
df2part <- subset(df2,select = setdiff(colnames(df2),colnames(df1)))*0
df1f <- cbind(df1,df2part)
df1part <- subset(df1,select = setdiff(colnames(df1),colnames(df2)))*0
df2f <- cbind(df2,df1part)
If the order really matters, then just reorder the columns
df2f <- df2f[, sort(names(df2f))]
Output
> df1f
A B C D E X
1 1 3 4 5 0 0
2 4 5 6 7 0 0
3 4 5 6 2 0 0
4 4 55 2 3 0 0
> df2f
A B C D E X
11 5 1 0 0 5 5
22 44 55 0 0 9 6
33 12 4 0 0 2 4
44 9 7 0 0 4 2
Option 2 -
library(data.table)
df1 <- data.table(df1)
df2 <- data.table(df2)
df1names <- colnames(df1)
df2names <- colnames(df2)
df1[,setdiff(df2names,df1names) := 0]
df2[,setdiff(df1names,df2names) := 0]

Conditonally delete columns in R

I know how to delete columns in R, but I am not sure how to delete them based on the following set of conditions.
Suppose a data frame such as:
DF <- data.frame(L = c(2,4,5,1,NA,4,5,6,4,3), J= c(3,4,5,6,NA,3,6,4,3,6), K= c(0,1,1,0,NA,1,1,1,1,1),D = c(1,1,1,1,NA,1,1,1,1,1))
DF
L J K D
1 2 3 0 1
2 4 4 1 1
3 5 5 1 1
4 1 6 0 1
5 NA NA NA NA
6 4 3 1 1
7 5 6 1 1
8 6 4 1 1
9 4 3 1 1
10 3 6 1 1
The data frame has to be set up in this fashion. Column K corresponds to column L, and column D, corresponds to column J. Because column D has values that are all equal to one, I would like to delete column D, and the corresponding column J yielding a dataframe that looks like:
DF
L K
1 2 0
2 4 1
3 5 1
4 1 0
5 NA NA
6 4 1
7 5 1
8 6 1
9 4 1
10 3 1
I know there has got to be a simple command to do so, I just can't think of any. And if it makes any difference, the NA's must be retained.
Additional helpful information, in my real data frame there are a total of 20 columns, so there are 10 columns like L and J, and another 10 that are like K and D, I need a function that can recognize the correspondence between these two groups and delete columns accordingly if necessary
Thank you in advance!
Okey, assuming the column-number based correspondence, here is an example:
> n <- 10
>
> # sample data
> d <- data.frame(lapply(1:n, function(x)sample(n)), lapply(1:n, function(x)sample(2, n, T, c(0.1, 0.9))-1))
> names(d) <- c(LETTERS[1:n], letters[1:n])
> head(d)
A B C D E F G H I J a b c d e f g h i j
1 5 5 2 7 4 3 4 3 5 8 0 1 1 1 1 1 1 1 1 1
2 9 8 4 6 7 8 8 2 10 5 1 1 1 1 1 1 1 1 1 1
3 6 6 10 3 5 6 2 1 8 6 1 1 1 1 1 1 1 1 1 1
4 1 7 5 5 1 10 10 4 2 4 1 1 1 1 1 1 1 1 1 1
5 10 9 6 2 9 5 6 9 9 9 1 1 0 1 1 1 1 1 1 1
6 2 1 1 4 6 1 5 8 4 10 1 1 1 1 1 1 1 1 1 1
>
> # find the column that should be left.
> idx <- which(colMeans(d[(n+1):(2*n)], na.rm = TRUE) != 1)
>
> # filter the data
> d[, c(idx, idx+n)]
A B C D F a b c d f
1 5 5 2 7 3 0 1 1 1 1
2 9 8 4 6 8 1 1 1 1 1
3 6 6 10 3 6 1 1 1 1 1
4 1 7 5 5 10 1 1 1 1 1
5 10 9 6 2 5 1 1 0 1 1
6 2 1 1 4 1 1 1 1 1 1
7 8 4 7 10 2 1 1 1 1 0
8 7 3 9 9 4 1 0 1 0 1
9 3 10 3 1 9 1 1 0 1 1
10 4 2 8 8 7 1 0 1 1 1
I basically agree with koshke (whose SO work is excellent), but would suggest that the test to use is colSums(d[(n+1):(2*n)], na.rm=TRUE) == NROW(d) , since a paired 0 and 2 or -1 and 3 could throw off the colMeans test.

Resources