Report all possible combinations of a string-separated vector - r

In the tidyverse I would like to mutate/expand a string vector so that all possible combinations of elements (separated by " & ") are reported, one for each line.
I tried decomposing my function using t(combn(unlist(strsplit(x, " & ")),2)), but fails when there is no " & ".
In the example:
"A" remains "A" (or becomes "A & A")
"A & B" remains "A & B"
"C & D & E" becomes "C & D", "C & E" and "D & E" in three different rows
Note (1): I cannot predict the number of combinations in advance "A & B & C & D..."
Note (2): Order is not important (i.e. "C & D" == "D & C")
Note (3): This would feed into a separate function and be used in a igraph application.
Thanks in advance.
data <- data.frame(names=c(1:3), combinations=c("A","A & B","C & D & E"))
names combinations
1 1 A
2 2 A & B
3 3 C & D & E
expected <- data.frame(projects=c(1,2,3,3,3), combinations=c("A","A & B","C & D","C & E","D & E"))
projects combinations
1 1 A
2 2 A & B
3 3 C & D
4 3 C & E
5 3 D & E

You can use combn to create combinations within each name :
library(dplyr)
library(tidyr)
data %>%
separate_rows(combinations, sep = ' & ') %>%
group_by(names) %>%
summarise(combinations = if(n() > 1)
combn(combinations, 2, paste0, collapse = ' & ') else combinations) %>%
ungroup
# names combinations
# <int> <chr>
#1 1 A
#2 2 A & B
#3 3 C & D
#4 3 C & E
#5 3 D & E

A data.table option
setnames(
setDT(data)[
,
{
s <- unlist(strsplit(combinations, " & "))
if (length(s) == 1) s else combn(s, 2, paste0, collapse = " & ")
},
names
], "V1", "combinations"
)[]
gives
names combinations
1: 1 A
2: 2 A & B
3: 3 C & D
4: 3 C & E
5: 3 D & E

Using data.table method
library(splitstackshape)
setnames(cSplit(data, 'combinations', sep=' & ', 'long', type.convert = FALSE)[,
if(.N > 1) combn(combinations, 2, FUN = paste, collapse = ' & ') else
combinations, names], 'V1', 'combinations')[]
# names combinations
#1: 1 A
#2: 2 A & B
#3: 3 C & D
#4: 3 C & E
#5: 3 D & E

Related

I have a sample dataset , which has missing values in it

I have a sample dataset , which has missing values in it.I want to create a new column with a message of different combinations where it should tell which columns values are missing.
Example:
Dataset:
A B C D
1 2 4
4 4
4 1
3 2 3
The permutaions of the above data set is :
"a" ,"b","c","d" ,"a, b","a, c" ,"a, d" , "b, c","b, d","c, d" , "a, b, c","a, b, d","a, c, d","b, c, d","a, b, c, d"
Result:
A B C D Message
1 2 4 Column B is missing
4 4 column A and D is Missing
4 1 Column C and D is Missing
All column values are missing
3 2 3 Column B is Missing
Any suggestion would be really appreciated
Here's a way using apply from base R -
set.seed(4)
df <- data.frame(matrix(sample(c(1:5, NA), 15, replace = T), ncol = 3))
names(df) <- LETTERS[1:3]
df$msg <- apply(df, 1, function(x) {
if(anyNA(x)) {
paste0(paste0(names(x)[which(is.na(x))], collapse = " "), " missing", collapse = "")
} else {
"No missing"
}
})
df
A B C msg
1 4 2 5 No missing
2 1 5 2 No missing
3 2 NA 1 B missing
4 2 NA NA B C missing
5 5 1 3 No missing

subsets of different vectors R

I have three vectors as shown below.
q = c("a == 1", "a == 2", "a == 3")
w = c("b >= 50", "b >= 100")
t = c("c >= 40 & c <= 80", "c > 80")
I want to be able to combine all the vectors into one large vector so that every possible subset is in a larger vector. For example I want to have
("a == 1 & b >= 50", "a == 1 & b >= 100", "a ==2 & b >=50",
"a == 2 & b >= 100", "a == 3 & b >= 50", "a == 3 & b >= 100",
"a ==1 & c >= 40 & c <= 80", "a ==1 & c > 80",
"a ==2 & c >= 40 & c <= 80", "a ==2 & c > 80",
"a ==3 & c >= 40 & c <= 80", "a ==3 & c > 80",
"b >= 50 & c >= 40 & c <= 80", "b >= 50 & c > 80",
"b >= 100 & c >= 40 & c <= 80", "b >= 100 & c > 80",
"a == 1 & b >= 50 & c >= 40 & c <= 80", "a == 1 & b >=50 & c > 80",
"a == 2 & b >= 50 & c >= 40 & c <= 80", "a == 2 & b >=50 & c > 80",
"a == 3 & b >= 50 & c >= 40 & c <= 80", "a == 3 & b >=50 & c > 80")
"a == 2 & b >= 100 & c >= 40 & c <= 80", "a == 2 & b >=100 & c > 80",
"a == 3 & b >= 100 & c >= 40 & c <= 80", "a == 3 & b >=100 & c > 80")
So I need every subset to be created and joined with the "&" sign but I don't want to be comparing any element in the same vector. I also have three vectors in this example but the number of vectors should be variable. Does anyone know how to achieve this? Thanks!
We can create strings using expand.grid and combn. Create a combn of list ('lst') elements picking 2 or 3 in a list (using lapply), expand the list elements into a data.frame and paste with do.call (specifying the sep as " & ")
lst <- list(q w, t)
unlist( lapply(2:3, function(i) combn(lst, i,
FUN = function(x) do.call(paste, c(expand.grid(x), sep = " & ")),
simplify = FALSE)))

Pairing truncated character into a dataframe

I have a chr[1:10] truncated data and each line is organized in such way and some rows don't have the same length:
[1] "\nA B C D E"
[2] "\n1 3 4 5"
[3] "\nF G H"
[4] "\n6 7 8"
Here's an updated version of my question
line.1 <- c("A B C D E")
line.2 <- c("1 3 4 5")
line <- rbind(line.1, line.2)
line <- data.frame(line)
line
line.1 A B C D E
line.2 1 3 4 5
So, my desired output should be:
V1 V2 V3 V4 V5
Line.1 A B C D E
Line.2 1 3 4 5
I can't quite figure out how to split it into different columns with the extra space in between being counted as one value.
Here's one way to do it:
# Build the character vector
x <- c("\nA B C D E", "\n1 3 4 5", "\nF G H", "\n6 7 8")
# Remove the new line characters
x <- sub("\n", "", x)
# Select every other element of the character vector as column 1
Col1 <- paste(x[c(T, F)], collapse = ' ')
Col1 <- strsplit(Col1, ' ')[[1]]
# Do the same for column 2
Col2 <- paste(x[c(F, T)], collapse = ' ')
Col2 <- strsplit(Col2, ' ')[[1]]
# Combine them in a data frame
data.frame(Col1, Col2)
# Col1 Col2
# 1 A 1
# 2 B
# 3 C 3
# 4 D 4
# 5 E 5
# 6 F 6
# 7 G 7
# 8 H 8
The use of strsplit is what splits the values into different columns:
> strsplit(line.2, ' ')[[1]]
[1] "1" "" " 3" "4" "5"
So to combine both lines as a dataframe, you can do:
data.frame(rbind(strsplit(line.1, ' ')[[1]], strsplit(line.2, ' ')[[1]]))

how to replace values by comparing two columns in r

I have a dataframe looks like:
df<-read.table(text="ID RE AL
140343 TC T
200012 A G
457096 GAA GAAA
555084 AG A
557151 T TAA
752311 GAATTAAT GAAT
810001 ATTTTT ATTTT
880420 GAAAAAAAAA GAAAAAAAAAA", header=TRUE, colClasses="character")
I would like to replace the longer string in column "RE" or "AL" with letter "I", and the shorter one replaced with letter "D". if both columns have one letter, no change.
the expected result:
ID RE AL
140343 I D
200012 A G
457096 D I
555084 I D
557151 D I
752311 I D
810001 I D
880420 D I
I tried my script as:
max <- apply(df[2:3], 1, function(x) max(nchar(x)))
index <- max > 1
if(nchar(df$RE[index])==max[index]){
df$RE[index] <- "I"
df$AL[index] <- "D"
}else{
df$RE[index] <- "D"
df$AL[index] <- "I"
}
A base R vectorized solution. First line defines a subset of rows to work on. Then two lines with opposite directions for the comparison lets you choose either "D" or "I" based on the comparisons:
noneq <- with( df, (nchar(RE) != 1)|( nchar(AL) != 1) )
df[ noneq, "RE"] <- with(df[ noneq, ], c("D","I")[1+(nchar(RE) > nchar(AL) )])
df[ noneq, "AL"] <- with(df[ noneq, ], c("D","I")[1+(RE=="D" )]) # opposite of RE
df
#==============
ID RE AL
1 140343 I D
2 200012 A G
3 457096 D I
4 555084 I D
5 557151 D I
6 752311 I D
7 810001 I D
8 880420 D I
Here is a dplyr solution that may work for you
library(dplyr)
df %>%
mutate(RE = ifelse(nchar(RE) != 1 | nchar(AL) != 1,
ifelse(nchar(RE) > nchar(AL), 'I', 'D'), RE),
AL = ifelse(RE=='I', 'D', ifelse(RE=='D', 'I', AL)))
## ID RE AL
## 1 140343 I D
## 2 200012 A G
## 3 457096 D I
## 4 555084 I D
## 5 557151 D I
## 6 752311 I D
## 7 810001 I D
## 8 880420 D I
Here is a simple for loop that gets the job done:
for (i in seq(1:nrow(df))){
if(nchar(df[i, 3]) - nchar(df[i, 2]) < 0){
df[i, 3] <- "D"
df[i, 2] <- "I"
}else if(nchar(df[i, 3]) - nchar(df[i, 2]) > 0){
df[i, 3] <- "I"
df[i, 2] <- "D"
}
}
An alternative base R solution (compareble to #42- 's answer, but with pre-defining the indexes):
# create needed indexes
idx1 <- !(nchar(df$RE) == 1 & nchar(df$AL) == 1)
idx2 <- (nchar(df$RE) > nchar(df$AL)) + 1L
idx3 <- (nchar(df$RE) < nchar(df$AL)) + 1L
# replace the values
df$RE[idx1] <- c('D','I')[idx2][idx1]
df$AL[idx1] <- c('D','I')[idx3][idx1]
which gives:
> df
ID RE AL
1 140343 I D
2 200012 A G
3 457096 D I
4 555084 I D
5 557151 D I
6 752311 I D
7 810001 I D
8 880420 D I

split and table

I have a data frame like this:
GN SN
a b
a b
a c
d e
d f
d e
I would like the following output:
GN: a SN: 2 b 1 c
GN d SN: 2 e 1 f
In other words I would like to have a sort of table() of the data.frame on the column S.N. First of all I splitted the data.frame according to $GN, so I have blocks. At this point I' m not able to have the counting of the elements on column SN according to the split I've done. Is the "apply" function a way to do this? And how can i save a general output belonging from split function?
Thanks in advance
With your data:
df <- data.frame(GN = rep(c("a","b"), each = 3),
SN = c(rep("b", 2), "c", "e", "f", "e"))
We could do:
> lapply(with(df, split(SN, GN)), table)
$a
b c e f
2 1 0 0
$b
b c e f
0 0 2 1
But if you don't want all the levels (the 0 entries) then we need to drop the empty levels:
> lapply(with(df, split(SN, GN)), function(x) table(droplevels(x)))
$a
b c
2 1
$b
e f
2 1
Writing out the individual tables to a file
This isn't perfect but at least you can work with it
## save tables
tmp <- lapply(with(df, split(SN, GN)), function(x) table(droplevels(x)))
## function to write output to file `fname`
foo <- function(x, fname) {
cat(paste(names(x), collapse = " "), "\n", file = fname, append = TRUE)
cat(paste(x, collapse = " "), "\n", file = fname, append = TRUE)
invisible()
}
fname <- "foo.txt"
file.create(fname) # create file fname
lapply(tmp, foo, fname = fname) # run our function to write to fname
That gives:
R> readLines(fname)
[1] "b c " "2 1 " "e f " "2 1 "
or from the OS:
$ cat foo.txt
b c
2 1
e f
2 1

Resources