I have the following data frame:
miniDF1 <- data.frame(Pred = c("A","A","B","A","B","B","C","A","B","C","A","A","A","A","B","A","C","B"))
Pred
1 A
2 A
3 B
4 A
5 B
6 B
7 C
8 A
9 B
10 C
11 A
12 A
13 A
14 A
15 B
16 A
17 C
18 B
I am trying to make a new column filled with 1's until "C" is found in Pred, and then fill the new column with 0's until the next "C" is found, and repeat as such until the end of the DF. I have tried the following:
miniDF1 <- miniDF1 %>%
mutate(Outcome = ifelse(str_detect(Pred, "C"), 1, 0)) %>%
fill(Outcome, .direction = 'up')
Pred Outcome
1 A 0
2 A 0
3 B 0
4 A 0
5 B 0
6 B 0
7 C 1
8 A 0
9 B 0
10 C 1
11 A 0
12 A 0
13 A 0
14 A 0
15 B 0
16 A 0
17 C 1
18 B 0
but this is only putting 1's in the same row where there are "C's" located.
This is how it is expected to look like:
miniDF2 <- data.frame(Pred = c("A","A","B","A","B","B","C","A","B","C","A","A","A","A","B","A","C","B"),
Outcome = c(1,1,1,1,1,1,1,0,0,0,1,1,1,1,1,1,1,0))
Pred Outcome
1 A 1
2 A 1
3 B 1
4 A 1
5 B 1
6 B 1
7 C 1
8 A 0
9 B 0
10 C 0
11 A 1
12 A 1
13 A 1
14 A 1
15 B 1
16 A 1
17 C 1
18 B 0
I can't figure out how to get the values to flip accordingly, but I thought that that's what the fill(Outcome, .direction = 'up') part of my code was intended to do.
miniDF1$Outcome <- cumsum(c(1, head(miniDF1$Pred == 'C', -1))) %% 2
miniDF1
Pred Outcome
1 A 1
2 A 1
3 B 1
4 A 1
5 B 1
6 B 1
7 C 1
8 A 0
9 B 0
10 C 0
11 A 1
12 A 1
13 A 1
14 A 1
15 B 1
16 A 1
17 C 1
18 B 0
IN tidyverse:
library(dplyr)
miniDF1 %>%
mutate(Outcome = cumsum(lag(Pred == 'C', default = TRUE)) %% 2)
Looks cumbersome but does what needed,
i1 <- which(miniDF1$Pred == 'C')
dd <- data.frame(v1 = c(1, 0),v2 = c(i1[1], abs(diff(i1)), nrow(miniDF1)-max(i1)))
rep(dd$v1, dd$v2)
#[1] 1 1 1 1 1 1 1 0 0 0 1 1 1 1 1 1 1 0
Maybe wrap it in a function too,
fun1 <- function(x, val){
i1 <- which(x == val)
dd <- data.frame(v1 = c(1, 0),
v2 = c(i1[1], abs(diff(i1)), nrow(miniDF1)-max(i1)))
return(rep(dd$v1, dd$v2))
}
miniDF1$outcome <- fun1(miniDF1$Pred, 'C')
# Pred outcome
#1 A 1
#2 A 1
#3 B 1
#4 A 1
#5 B 1
#6 B 1
#7 C 1
#8 A 0
#9 B 0
#10 C 0
#11 A 1
#12 A 1
#13 A 1
#14 A 1
#15 B 1
#16 A 1
#17 C 1
#18 B 0
You may try using rleid like
miniDF1$Outcome <- ifelse(data.table::rleid(miniDF1$Pred == "C") %% 4 %in% c(1,2), 1, 0)
miniDF1
Pred Outcome
1 A 1
2 A 1
3 B 1
4 A 1
5 B 1
6 B 1
7 C 1
8 A 0
9 B 0
10 C 0
11 A 1
12 A 1
13 A 1
14 A 1
15 B 1
16 A 1
17 C 1
18 B 0
explanation
For easier comparison, let's try miniDF1$x <- rleid(miniDF1$Pred == "C").
Pred Outcome x
1 A 1 1
2 A 1 1
3 B 1 1
4 A 1 1
5 B 1 1
6 B 1 1
7 C 1 2
8 A 0 3
9 B 0 3
10 C 0 4
11 A 1 5
12 A 1 5
13 A 1 5
14 A 1 5
15 B 1 5
16 A 1 5
17 C 1 6
18 B 0 7
You can see that rleid's result change if C appears and next. Also, as you need to switch from 1 to 0 as C appears.
It means, if x is 1 and 2 or 3 and 4 or .... need to have 1, 0, 1, and so on. So I divide x with 4 and get repeated 1 and 2/ 3 and 0 values.
I have 15 columns and I want to group by values in each column by either 0 or 1 or na.
my dataset
A,B,C,D,E,F,G,H,I,J,K,L,M,N,O
0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,1.0
NA,1.0,0.0,0.0,NA,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0
1.0,1.0,1.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,NA,NA,NA,NA
1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,NA,0.0,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,1.0,0.0,1.0
1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,1.0,1.0,1.0
1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0
0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,0.0,1.0
1.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,1.0,0.0,0.0
1.0,1.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0,1.0,1.0,0.0,1.0,0.0,1.0
0.0,1.0,1.0,0.0,0.0,0.0,NA,NA,NA,NA,NA,NA,NA,NA,NA
1.0,1.0,1.0,1.0,0.0,0.0,1.0,1.0,1.0,0.0,0.0,0.0,1.0,1.0,0.0
NA,NA,1.0,NA,NA,0.0,1.0,1.0,1.0,1.0,0.0,1.0,1.0,1.0,1.0
0.0,1.0,0.0,0.0,0.0,0.0,0.0,NA,0.0,0.0,NA,NA,NA,NA,NA
I want output to be like:
A B C D E F G H I J K L M N O
0 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
1 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
NA 5 6 2 3 5 0 1 2 3 4 1 2 0 0 1
We can loop through the dataset and apply the table with useNA="always"
sapply(df1, table, useNA="always")
If there are only a particular value in a column, say 1, then convert it to factor with levels specified as 0 and 1
sapply(df1, function(x) table(factor(x, levels = 0:1), useNA = "always"))
# A B C D E F G H I J K L M N O
#0 4 3 8 7 17 15 14 11 14 12 12 10 8 11 9
#1 19 21 17 17 6 9 10 12 8 11 8 10 12 9 11
#<NA> 2 1 0 1 2 1 1 2 3 2 5 5 5 5 5
I know how to delete columns in R, but I am not sure how to delete them based on the following set of conditions.
Suppose a data frame such as:
DF <- data.frame(L = c(2,4,5,1,NA,4,5,6,4,3), J= c(3,4,5,6,NA,3,6,4,3,6), K= c(0,1,1,0,NA,1,1,1,1,1),D = c(1,1,1,1,NA,1,1,1,1,1))
DF
L J K D
1 2 3 0 1
2 4 4 1 1
3 5 5 1 1
4 1 6 0 1
5 NA NA NA NA
6 4 3 1 1
7 5 6 1 1
8 6 4 1 1
9 4 3 1 1
10 3 6 1 1
The data frame has to be set up in this fashion. Column K corresponds to column L, and column D, corresponds to column J. Because column D has values that are all equal to one, I would like to delete column D, and the corresponding column J yielding a dataframe that looks like:
DF
L K
1 2 0
2 4 1
3 5 1
4 1 0
5 NA NA
6 4 1
7 5 1
8 6 1
9 4 1
10 3 1
I know there has got to be a simple command to do so, I just can't think of any. And if it makes any difference, the NA's must be retained.
Additional helpful information, in my real data frame there are a total of 20 columns, so there are 10 columns like L and J, and another 10 that are like K and D, I need a function that can recognize the correspondence between these two groups and delete columns accordingly if necessary
Thank you in advance!
Okey, assuming the column-number based correspondence, here is an example:
> n <- 10
>
> # sample data
> d <- data.frame(lapply(1:n, function(x)sample(n)), lapply(1:n, function(x)sample(2, n, T, c(0.1, 0.9))-1))
> names(d) <- c(LETTERS[1:n], letters[1:n])
> head(d)
A B C D E F G H I J a b c d e f g h i j
1 5 5 2 7 4 3 4 3 5 8 0 1 1 1 1 1 1 1 1 1
2 9 8 4 6 7 8 8 2 10 5 1 1 1 1 1 1 1 1 1 1
3 6 6 10 3 5 6 2 1 8 6 1 1 1 1 1 1 1 1 1 1
4 1 7 5 5 1 10 10 4 2 4 1 1 1 1 1 1 1 1 1 1
5 10 9 6 2 9 5 6 9 9 9 1 1 0 1 1 1 1 1 1 1
6 2 1 1 4 6 1 5 8 4 10 1 1 1 1 1 1 1 1 1 1
>
> # find the column that should be left.
> idx <- which(colMeans(d[(n+1):(2*n)], na.rm = TRUE) != 1)
>
> # filter the data
> d[, c(idx, idx+n)]
A B C D F a b c d f
1 5 5 2 7 3 0 1 1 1 1
2 9 8 4 6 8 1 1 1 1 1
3 6 6 10 3 6 1 1 1 1 1
4 1 7 5 5 10 1 1 1 1 1
5 10 9 6 2 5 1 1 0 1 1
6 2 1 1 4 1 1 1 1 1 1
7 8 4 7 10 2 1 1 1 1 0
8 7 3 9 9 4 1 0 1 0 1
9 3 10 3 1 9 1 1 0 1 1
10 4 2 8 8 7 1 0 1 1 1
I basically agree with koshke (whose SO work is excellent), but would suggest that the test to use is colSums(d[(n+1):(2*n)], na.rm=TRUE) == NROW(d) , since a paired 0 and 2 or -1 and 3 could throw off the colMeans test.