All the possible outcomes - r

My data frame looks like this
gen<-c("A","B","C")
prob<-c("0.95","0.82","0.78")
mw<-c("10","20","50")
df<-data.frame(gen,prob,mw)
gen prob mw
1 A 0.95 10
2 B 0.82 20
3 C 0.78 50
Now I want to have all the possible outcomes of (A,B,C), (A,B),(A,C),(B,C),(A),(B),(C),(NONE) with the probabilities for example (A,B,C)=0.95*0.82*0.78= 0.60762.
trials <- data.frame(matrix(nrow=0,ncol=length(gen)))
for(i in 1:length(gen)){
trials.tmp <- t(combn(gen,i))
trials <- rbind(trials,cbind(trials.tmp, matrix(nrow=nrow(trials.tmp),
ncol=length(gen)-i ) ))
}
trials
V1 V2 V3
1 A <NA> <NA>
2 B <NA> <NA>
3 C <NA> <NA>
4 A B <NA>
5 A C <NA>
6 B C <NA>
7 A B C
But still I'm missing the combination (NA,NA,NA). How can I make a new data frame with all outcomes and probabilities.

you can try combn as mentioned by zx8754, for example
a=0.95
b=0.82
c=0.78
x <- c(a,b,c)
df <- rbind(t(combn(x, 3)), cbind(t(combn(x, 2)), NA), cbind(t(combn(x, 1)), NA, NA))
apply(df, 1, function(x) prod(x[!is.na(x)]))
[1] 0.60762 0.77900 0.74100 0.63960 0.95000 0.82000 0.78000

Related

Convert a list with inconsistent naming to a data frame, with variable depth

Consider the following list:
x <- list("a" = list("b", "c"),
"d" = list("e", "f" = list("g", "h")),
"i" = list("j", "k" = list("l" = list("m", "n" = list("o", "p")))))
It is worth noting that:
Not all names and elements are going to be of one character
There is an undetermined level of nesting a priori.
Given x, my aim is to output the data frame:
y <- data.frame(
main_level = c(rep("a", 2), rep("d", 3), rep("i", 4)),
level1 = c("b", "c", "e", rep("f", 2), "j", rep("k", 3)),
level2 = c(NA, NA, NA, "g", "h", NA, "l", "l", "l"),
level3 = c(NA, NA, NA, NA, NA, NA, "m", "n", "n"),
level4 = c(NA, NA, NA, NA, NA, NA, NA, "o", "p")
)
> y
main_level level1 level2 level3 level4
1 a b <NA> <NA> <NA>
2 a c <NA> <NA> <NA>
3 d e <NA> <NA> <NA>
4 d f g <NA> <NA>
5 d f h <NA> <NA>
6 i j <NA> <NA> <NA>
7 i k l m <NA>
8 i k l n o
9 i k l n p
NOTE that a typo was corrected in y above.
The above implies that there will be a variable number of columns as well, depending on the depth of the nesting.
Solutions online that I've found, when it comes to nested lists, assume that the list naming structure is more or less consistent, which is of course not the case here; or that the list depth is identical. For instance, the solutions at How to convert a nested lists to dataframe in R? and Converting nested list to dataframe do not apply because they are much more consistent in their naming.
Here's a way mainly relying on rrapply:
rrapply::rrapply(x, how = "melt") |>
apply(1, function(row){
newrow <- row[grep("[A-Za-z]", row)]
length(newrow) <- purrr::vec_depth(x) - 1
newrow
}) |>
t() |> as.data.frame() |>
`colnames<-`(c("main_level", paste0("level", 1:4)))
output
main_level level1 level2 level3 level4
1 a b <NA> <NA> <NA>
2 a c <NA> <NA> <NA>
3 d e <NA> <NA> <NA>
4 d f g <NA> <NA>
5 d f h <NA> <NA>
6 i j <NA> <NA> <NA>
7 i k l m <NA>
8 i k l n o
9 i k l n p
Note that so far it is quite crude. There might be a better way to reshape the output of rrapply. For instance, row[grep("[A-Za-z]", row)] may not work every time. I have also not tested whether length(newrow) <- purrr::vec_depth(x) - 1 is a good way of guessing the length, but it works here.
Here is a recursive function that has no assumptions other than the structure you described:
list_to_df <- function(l) {
leaves <- list()
go_deeper <- function(l, index=1, path=NULL) {
# we can still go deeper
if (is.list(l[[index]])) {
path <- c(path, names(l)[index])
l <- l[[index]]
lapply(seq_along(l), function(i) go_deeper(l, i, path))
# this is the final node (leaf)
} else {
leaves <<- c(leaves, list(c(path, l[[index]])))
}
}
# this saves the paths to each last node (leaf) in 'leaves' as a side effect
go_deeper(list(l))
# now just make a data frame from the 'leaves' list
len.max <- max(lengths(leaves))
leaves <- sapply(leaves, function(x) c(x, rep(NA, len.max-length(x))))
leaves <- as.data.frame(t(leaves))
names(leaves) <- c('main_level', paste0('level', seq_len(ncol(leaves)-1)))
leaves
}
list_to_df(x)
# main_level level1 level2 level3 level4
# 1 a b <NA> <NA> <NA>
# 2 a c <NA> <NA> <NA>
# 3 d e <NA> <NA> <NA>
# 4 d f g <NA> <NA>
# 5 d f h <NA> <NA>
# 6 i j <NA> <NA> <NA>
# 7 i k l m <NA>
# 8 i k l n o
# 9 i k l n p
Here is a slightly more robust approach using rrapply() inspired by Maël's answer:
Set all missing/empty names in the nested list to NA
Melt to a long data.frame with the node paths as data.frame rows
Replace the first NA on each node path by its leaf value (value column)
library(rrapply)
x1 <- rrapply(x, f = \(x, .xname) ifelse(grepl("^\\d*$", .xname), NA, .xname), how = "names")
x2 <- rrapply(x1, how = "melt")
x3 <- apply(x2, 1, \(x){ x[is.na(x)][1] <- x[["value"]]; x })
as.data.frame(t(x3[-nrow(x3), ]))
#> L1 L2 L3 L4 L5
#> 1 a b <NA> <NA> <NA>
#> 2 a c <NA> <NA> <NA>
#> 3 d e <NA> <NA> <NA>
#> 4 d f g <NA> <NA>
#> 5 d f h <NA> <NA>
#> 6 i j <NA> <NA> <NA>
#> 7 i k l m <NA>
#> 8 i k l n o
#> 9 i k l n p

Split variable from comma into an ordered dataframe

I have a dataframe like this, where the values are separated by comma.
# Events
# A,B,C
# C,D
# B,A
# D,B,A,E
# A,E,B
I would like to have the next data frame
# Event1 Event2 Event3 Event4 Event5
# A B C NA NA
# NA NA C NA NA
# A B NA NA NA
# A B NA D E
# A B NA NA E
I have tried with cSplit but I don't have the desired df. Is possible?
NOTE: The values doesn't appear in the same possition as the variable Event in the second dataframe.
1) Here is a base R solution. split each row giving list s and create cols which contains the possible values. Then iterate over s and convert that to a data frame.
Note that this does not hard code the column names and continues to work even if some column names are substrings of other column names.
s <- strsplit(DF$Events, ",")
cols <- unique(sort(unlist(s)))
data.frame(Event = t(sapply(s, function(x) ifelse(cols %in% x, cols, NA))))
giving:
Event.1 Event.2 Event.3 Event.4 Event.5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E
2) This base R solution uses strsplit as above and then names the components since stack requires a named list and then invokes stack. Then we expand that into a wide form using tapply and convert it to a data frame and fix up the names.
s <- strsplit(DF$Events, ",")
names(s) <- seq_along(s)
stk <- stack(s)
mat <- t(tapply(stk$values, stk, c))
colnames(mat) <- NULL
data.frame(Event = mat)
giving:
Event.1 Event.2 Event.3 Event.4 Event.5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E
This could also be represented as an R 4.2+ pipeline:
DF |>
with(setNames(Events, seq_along(Events))) |>
strsplit(",") |>
stack() |>
with(tapply(values, data.frame(ind, values), c)) |>
`colnames<-`(NULL) |>
data.frame(Event = _)
Note
The input in reproducible form:
Lines <- "Events
A,B,C
C,D
B,A
D,B,A,E
A,E,B"
DF <- read.table(text = Lines, header = TRUE, strip.white = TRUE)
Another approach using tidyverse:
library(dplyr)
library(purrr)
library(stringr)
Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")
letters <- Events %>% str_split(",") %>% unlist() %>% unique()
df <- data.frame(Events)
df %>%
map2_dfc(.y = letters, ~ ifelse(str_detect(.x, .y), .y, NA)) %>%
set_names(nm = paste0("Events", 1:length(letters)))
#> # A tibble: 5 × 5
#> Events1 Events2 Events3 Events4 Events5
#> <chr> <chr> <chr> <chr> <chr>
#> 1 A B C <NA> <NA>
#> 2 <NA> <NA> C D <NA>
#> 3 A B <NA> <NA> <NA>
#> 4 A B <NA> D E
#> 5 A B <NA> <NA> E
Created on 2022-07-11 by the reprex package (v2.0.1)
This tidyverse solution is easily the most economical in terms of amount of code used:
library(tidyverse)
data.frame(Events) %>%
# split the strings by the comma:
mutate(Events = str_split(Events, ",")) %>%
# unnest splitted values wider into columns:
unnest_wider(Events, names_sep = "")
# A tibble: 5 × 4
Events1 Events2 Events3 Events4
<chr> <chr> <chr> <chr>
1 A B C NA
2 C D NA NA
3 B A NA NA
4 D B A E
5 A E B NA
Data:
Events = c("A,B,C", 'C,D', "B,A", "D,B,A,E", "A,E,B")
We can try the following base R code
> d <- t(table(stack(setNames(strsplit(df$Events, ","), 1:nrow(df)))))
> as.data.frame.matrix(`dim<-`(colnames(d)[ifelse(d > 0, d * col(d), NA)], dim(d)))
V1 V2 V3 V4 V5
1 A B C <NA> <NA>
2 <NA> <NA> C D <NA>
3 A B <NA> <NA> <NA>
4 A B <NA> D E
5 A B <NA> <NA> E

If else logic for single column in data frame then apply to all columns

How can I make in data frame if certain condition applies to a single column then all rows for that data frame will be Null.
For example I had a data frame.
A<- c(2,3,5,6,5,7,8,5)
B <- c("AB", "BC", "CD", "DE", "EF", "FG", "HI", "IJ")
C <- c("X", "Y", "Z", "W", "X", "Y", "Z", "W")
ABC <-data.frame(A,B,C)
> ABC
A B C
1 2 AB X
2 3 BC Y
3 5 CD Z
4 6 DE W
5 5 EF X
6 7 FG Y
7 8 HI Z
8 5 IJ W
And I want for every ABC$A equals to 5, then all rows connects with 5 will be NULL or NA.
My desired output should look like this
> ABC
A B C
1 2 AB X
2 3 BC Y
3 5 <NA> <NA>
4 6 DE W
5 5 <NA> <NA>
6 7 FG Y
7 8 HI Z
8 5 <NA> <NA>
ifelse function could make this work but what if I had a lot of columns. And I want it to apply to all columns besides A.
You could use row/column subsetting and assign NAs to all columns directly.
ABC[ABC$A == 5, -1] <- NA
ABC
# A B C
#1 2 AB X
#2 3 BC Y
#3 5 <NA> <NA>
#4 6 DE W
#5 5 <NA> <NA>
#6 7 FG Y
#7 8 HI Z
#8 5 <NA> <NA>
Here -1 is to ignore first column A since we do not want to change any values in that column.
Some other variations
ABC[-1] <- lapply(ABC[-1], function(x) replace(x, ABC$A == 5, NA))
and using dplyr
library(dplyr)
ABC %>% mutate_at(-1, ~replace(., A == 5, NA))
We can use na_if
library(dplyr)
ABC %>%
mutate_at(-1, list(~ na_if(A, y = 5)))

Unequal rows in list from unstack() - how to create a dataframe

I am (trying) to do a Robust ANOVA analysis in R. This requires that my two variables are in a very specific format. Basically, the requirement is to unstack two columns in my current dataframe and form an outcome frequency dataframe based on the predictor (categorical variable). This would usually happen automatically using the unstack() function i.e.
newDataFrame <- unstack(oldDataFrame, scores ~ columns)
However, the list returned has unequal rows for each category. Here is an example:
$A
[1] 2 4 2 3 3
$B
[1] 3 3
$C
[1] 5
$D
[1] 4 4 3
A, B, C and D are my categories, and the numbers are the outcome. The outcome has to be 1, 2, 3, 4, 5 or 6.
What I am working towards is the category as the 'header' and the outcome as a reference column, with the frequencies as the other columns, such that the dataframe looks like this:
A B C D
1 NA NA NA NA
2 2 NA NA NA
3 2 2 NA 1
4 1 NA NA 2
5 NA NA 1 NA
6 NA NA NA NA
What I have tried:
On another SO post, I found this -
library(stringi)
res <- as.data.frame(t(stri_list2matrix(myUnstackedList)))
colnames(res) <- unique(unlist(sapply(myUnstackedList, names)))
Outcome:
res
1 2 4 2 3 3
2 3 3 <NA> <NA> <NA>
3 5 <NA> <NA> <NA> <NA>
4 4 4 3 <NA> <NA>
Note that the categories A, B, C, D have been changed to 1, 2, 3, 4
Also tried this (another SO post):
df <- as.data.frame(plyr::ldply(myUnstackedList, rbind))
Outcome:
df
outcome group score
2 A 2
3 A 2
4 A 1
3 B 2
etc
Any tips?
This gets you most of the way to your answer:
test <- list(A=c(2,4,2,3,3),
B=c(3,3),
C=c(5),
D=c(4,4,3))
test <- lapply(1:length(test), function(i){
x <- data.frame(names(test)[i], test[i],
stringsAsFactors=FALSE)
names(x) <- c("ID", "Value")
x})
test <- bind_rows(test) %>% table %>% as.data.frame
test <- spread(test, key=ID, value=Freq)
replace(test, test==0, NA)
I'm not sure what the issue was with your previous dplyr attempt, however, I offer
library(tidyr)
library(dplyr)
df <- tibble(
outcome = c(1:5, 1:2, 1, 1:3),
group = c(rep("A", 5), rep("B", 2), "C", rep("D", 3)),
score = c(2, 4, 2, 3, 3, 3, 3, 5, 4, 4, 3)
)
df %>%
group_by(outcome) %>%
spread(group, score) %>%
ungroup() %>%
select(-outcome)
# # A tibble: 5 x 4
# A B C D
# * <dbl> <dbl> <dbl> <dbl>
# 1 2 3 5 4
# 2 4 3 NA 4
# 3 2 NA NA 3
# 4 3 NA NA NA
# 5 3 NA NA NA

left_join(x,y) and NA

After seeing this post with a nice answer by #akrun, I wanted to play with dplyr. Here are the sample data from the post and akrun.
df = data.frame(
id1 = c(1,1,2,2,2,3,3,3,3),
id2 = c(1,2,1,2,3,1,2,3,4),
X1 = letters[1:9],
X2 = LETTERS[1:9],
stringsAsFactors = FALSE
)
df2 <- data.frame(
id1 = rep(c(1:3), each = 4),
id2 = rep(c(1:4), times = 3),
stringsAsFactors = FALSE
)
If I replicate akrun's answer, merge() perfectly works here.
df %>%
do(merge(., df2, by = c("id1","id2"), all = TRUE))
id1 id2 X1 X2
1 1 1 a A
2 1 2 b B
3 1 3 <NA> <NA>
4 1 4 <NA> <NA>
5 2 1 c C
6 2 2 d D
7 2 3 e E
8 2 4 <NA> <NA>
9 3 1 f F
10 3 2 g G
11 3 3 h H
12 3 4 i I
Then, I thought left_join(x,y) would do. left_join(x,y) includes all of x, and matching rows of y. From the examples in the dplyr tutorial pdf from UseR!2014, I expected an identical result. But, that was not the case.
> df %>%
+ left_join(df2, .)
Joining by: c("id1", "id2")
id1 id2 X1 X2
1 1 1 a A
2 1 2 b B
3 1 3 <NA> <NA>
4 1 4 <NA> <NA>
5 2 1 <NA> <NA>
6 2 2 <NA> <NA>
7 2 3 <NA> <NA>
8 2 4 <NA> <NA>
9 3 1 <NA> <NA>
10 3 2 <NA> <NA>
11 3 3 <NA> <NA>
12 3 4 <NA> <NA>
The first three rows indicate that dplyr was doing the right job. But, once it encountered NA, it generated NAs till the end. Is this a bug or did I do something wrong? Thank you for taking your time.
There are currently a few bugs with dplyr and the _join functions:
https://github.com/hadley/dplyr/issues/542
https://github.com/hadley/dplyr/issues/455
https://github.com/hadley/dplyr/issues/450
I looks like they are being fixed. In the mean time, if you make sure the group-by variables are the same type (they aren't in your example - you can tell by using str()), then it should work:
df = data.frame(
id1 = c(1,1,2,2,2,3,3,3,3),
id2 = c(1,2,1,2,3,1,2,3,4),
X1 = letters[1:9],
X2 = LETTERS[1:9],
stringsAsFactors = FALSE
)
df2 <- data.frame(
id1 = as.numeric(rep(c(1:3), each = 4)),
id2 = as.numeric(rep(c(1:4), times = 3)),
stringsAsFactors = FALSE
)
left_join(df2, df)

Resources