This is my character "NGNG" , here N represents either of c("A", "T", "C", "G"), so in my output I need a total of 16 combinations such as AGAG, TGAG, CGAG, GGAG, TGTG, TGCG, TGGG and so on.
If it is only a single change at the start for example "NGG" I can easily do it with expand_grid from tidyr
library(tidyverse)
expand_grid(one = c("A", "T", "C", "G"), two = "NG") %>%
mutate(three = paste0(one, two)) %>%
pull(three)
[1] "ANG" "TNG" "CNG" "GNG"
But I'm struggling to find a way to do this when N comes in the middle or multiples of it.
How about expand.grid followed by do.call?
cart_prod <- expand.grid(c("A", "T", "C", "G"),
"G",
c("A", "T", "C", "G"),
"G")
do.call(paste0, cart_prod)
[1] "AGAG" "TGAG" "CGAG" "GGAG" "AGTG" "TGTG" "CGTG" "GGTG"
[9] "AGCG" "TGCG" "CGCG" "GGCG" "AGGG" "TGGG" "CGGG" "GGGG"
Explanation
Since the OP requested that index 2 and 4 remain as "G", we simply let the first 1st and 3rd argument vary over the possible choices: c("A", "T", "C", "G"). Now, calling expand.grid with the first 4 arguments as:
c("A", "T", "C", "G")
"G"
c("A", "T", "C", "G")
"G"
will produce a data.frame that is isomorphic to our desired result, since expand.grid returns the Cartesian product.
expand.grid(c("A", "T", "C", "G"),
"G",
c("A", "T", "C", "G"),
"G")
Var1 Var2 Var3 Var4
1 A G A G
2 T G A G
3 C G A G
4 G G A G
5 A G T G
6 T G T G
7 C G T G
8 G G T G
9 A G C G
10 T G C G
11 C G C G
12 G G C G
13 A G G G
14 T G G G
15 C G G G
16 G G G G
Now, all that is left is smashing the columns together. We make use of do.call and paste0 to achieve this.
Why does do.call(paste0, some_data.frame) Work?
I found this great explanation on do.call here: The {do.call} function. Here is the first line:
"R has an interesting function called do.call. This function allows you to call any R function, but instead of writing out the arguments one by one, you can use a list to hold the arguments of the function."
Since a data.frame is essentially a list under the hood, we can utilize do.call in the usual way.
Since each column of cart_prod is simply a vector, paste0 combines each column element-wise. For example, the first and second column are:
cart_prod$Var1
[1] A T C G A T C G A T C G A T C G
Levels: A T C G
cart_prod$Var2
[1] G G G G G G G G G G G G G G G G
Levels: G
Applying paste0 to these two, gives:
paste0(cart_prod$Var1, cart_prod$Var2)
[1] "AG" "TG" "CG" "GG" "AG" "TG" "CG" "GG"
[9] "AG" "TG" "CG" "GG" "AG" "TG" "CG" "GG"
As you can see, we are starting to see our desired result come together. If we were to combine this result with the third column, we would obtain:
paste0(paste0(cart_prod$Var1, cart_prod$Var2), cart_prod$Var3)
[1] "AGA" "TGA" "CGA" "GGA" "AGT" "TGT" "CGT" "GGT"
[9] "AGC" "TGC" "CGC" "GGC" "AGG" "TGG" "CGG" "GGG"
And now, we combine this result with the last column:
paste0(paste0(paste0(cart_prod$Var1, cart_prod$Var2), cart_prod$Var3), cart_prod$Var4)
[1] "AGAG" "TGAG" "CGAG" "GGAG" "AGTG" "TGTG" "CGTG" "GGTG"
[9] "AGCG" "TGCG" "CGCG" "GGCG" "AGGG" "TGGG" "CGGG" "GGGG"
Voila! We have our desired result.
Here is a weird approach on how to achieve your desired output:
Here are a few notes on this solution:
I wrapped map2 function around curly braces so I can choose .x and .y myself as %>% put the LHS (here a data frame) in the first argument
exec function applies a function on a list of arguments which acts more like do.call in base R and using !!! will splice elements of the resulting list so that each one become a single argument for then to be bound by rows
library(purrr)
N <- c("A", "T", "C", "G")
expand.grid(N, N) %>%
{map2(.$Var1, .$Var2, ~ paste0(.x, "G", .y, "G"))} %>%
exec(rbind, !!!.)
[,1]
[1,] "AGAG"
[2,] "TGAG"
[3,] "CGAG"
[4,] "GGAG"
[5,] "AGTG"
[6,] "TGTG"
[7,] "CGTG"
[8,] "GGTG"
[9,] "AGCG"
[10,] "TGCG"
[11,] "CGCG"
[12,] "GGCG"
[13,] "AGGG"
[14,] "TGGG"
[15,] "CGGG"
[16,] "GGGG"
Related
How do I pull out unique values from each column in a data frame (both numeric and strings) and make into one column?
a = c("a", "b", "c", "d", "a")
b = c(1, 2, 3, 4, 3)
df <- cbind(a, b)
The preferred output would be:
variable Level
a a
a b
a c
a d
b 1
b 2
b 3
b 4
The sample data above is simple but the intent is to be able to use the answer for multiple data frame with different column names and data in them. Thank you.
Quick + scalable
Tidyr's gather and dplyr's distinct gives you a quick way to get that structure. (I left the package calls in the functions so you can remember which one is from which package, which I always forget.)
library(tidyverse)
a = c("a", "b", "c", "d", "a")
b = c(1, 2, 3, 4, 3)
data.frame(a,b) %>% tidyr::gather() %>% dplyr::distinct()
key value
1 a a
2 a b
3 a c
4 a d
5 b 1
6 b 2
7 b 3
8 b 4
We place it in a list, get the unique elements, set the names with letters and then stack to data.frame
d1 <- stack(setNames(lapply(list(a, b), unique), letters[1:2]))[2:1]
colnames(d1) <- c('variable', 'Level')
df data.frame creation:
a = c("a", "b", "c", "d", "a")
b = c(1, 2, 3, 4, 3)
df <- cbind(a, b)
Columns name extraction
names<-colnames(df)
Data Extration
variable<-NULL
Level<-NULL
for(i in 1:length(names))
{
variable<-c(variable,rep(names[i],length(unique(df[,i]))))
Level<-c(Level,unique(df[,i]))
}
Your generic output
db<-cbind(variable,Level)
db
variable Level
[1,] "a" "a"
[2,] "a" "b"
[3,] "a" "c"
[4,] "a" "d"
[5,] "b" "1"
[6,] "b" "2"
[7,] "b" "3"
[8,] "b" "4"
I'm asking to how to merge two lists in parallel, not orderly append as below codes.
For example,
A <- list(c(1,2,3), c(3,4,5), c(6,7,8))
B <- list(c("a", "b", "c"), c("d", "e", "f"), c("g", "h", "i"))
As results,
[[1]]
[[1]][[1]]
[1] 1 2 3
[[1]][[2]]
[1] "a" "b" "c"
[[2]]
[[2]][[1]]
[1] 3 4 5
[[2]][[2]]
[1] "d" "e" "f"
[[3]]
[[3]][[1]]
[1] 6 7 8
[[3]][[2]]
[1] "g" "h" "i"
Using Map simply:
Map(list,A,B)
A longer approach (not recursive yet, up to second level merging):
A <- list(c(1,2,3), c(3,4,5), c(6,7,8))
B <- list(c("a", "b", "c"), c("d", "e", "f"), c("g", "h", "i"))
mergepar <- function(x = A, y = B) { # merge two lists in parallel
ln <- max(length(x), length(y)) # max length
newlist <- as.list(rep(NA, ln)) # empty list of max length
for (i in 1:ln) { # for1, across length
# two level subsetting (first with [ and then [[, so no subscript out of bound error) and lapply
newlist[[i]] <- lapply(list(A, B), function(x) "[["("["(x, i), 1))
}
return(newlist)
}
I have a vector like:
c("A", "B", "C", "D", "E", "F")
and I'd like to create a dataframe like
"from" "to"
A B
B C
C D
D E
E F
how can I accomplish that?
Another way:
data.frame(from = vec[-length(vec)], to = vec[-1])
na.omit(data.frame(from = vec, to = dplyr::lead(vec)))
from to
1 A B
2 B C
3 C D
4 D E
5 E F
Another way is to use zoo package,
library(zoo)
rollapply(vec, 2, by = 1, paste)
Here is one method using embed and rearranging columns:
# data
temp <- c("A", "B", "C", "D", "E", "F")
embed(temp, 2)[, c(2,1)]
[,1] [,2]
[1,] "A" "B"
[2,] "B" "C"
[3,] "C" "D"
[4,] "D" "E"
[5,] "E" "F"
to put this into a data.frame, wrap it in data.frame:
setNames(data.frame(embed(temp, 2)[, c(2,1)]), c("from", "to"))
from to
1 A B
2 B C
3 C D
4 D E
5 E F
We could also do:
vec <- c("A", "B", "C", "D", "E", "F")
x <- rep(seq(length(vec)), each=2)[-length(vec)*2][-1]
# [1] 1 2 2 3 3 4 4 5 5 6
data.frame(matrix(vec[x], ncol = 2, byrow = T))
Or alternatively:
data.frame(t(sapply(seq(length(vec)-1), function(i) c(vec[i], vec[i+1]))))
# X1 X2
# 1 A B
# 2 B C
# 3 C D
# 4 D E
# 5 E F
I have a vector of objects (object) along with a corresponding vector of time frames (tframe) in which the objects were observed. For each unique pair of objects, I want to calculate the number of time frames in which both objects were observed.
I can write the code using for() loops, but it takes a long time to run as the number of unique objects increases. How might I change the code to speed up the run time?
Below is an example with 4 unique objects (in reality I have about 300). For example, objects a and c were both observed in time frames 1 and 2, so they get a count of 2. Objects b and d were never observed in the same time frame, so they get a count of 0.
object <- c("a", "a", "a", "b", "b", "c", "c", "c", "c", "d")
tframe <- c(1, 1, 2, 2, 3, 1, 2, 2, 3, 1)
uo <- unique(object)
n <- length(uo)
mpairs <- matrix(NA, nrow=n*(n-1)/2, ncol=3, dimnames=list(NULL,
c("obj1", "obj2", "sametf")))
row <- 0
for(i in 1:(n-1)) {
for(j in (i+1):n) {
row <- row+1
mpairs[row, "obj1"] <- uo[i]
mpairs[row, "obj2"] <- uo[j]
# no. of time frames in which both objects in a pair were observed
intwin <- intersect(tframe[object==uo[i]], tframe[object==uo[j]])
mpairs[row, "sametf"] <- length(intwin)
}}
data.frame(object, tframe)
object tframe
1 a 1
2 a 1
3 a 2
4 b 2
5 b 3
6 c 1
7 c 2
8 c 2
9 c 3
10 d 1
mpairs
obj1 obj2 sametf
[1,] "a" "b" "1"
[2,] "a" "c" "2"
[3,] "a" "d" "1"
[4,] "b" "c" "2"
[5,] "b" "d" "0"
[6,] "c" "d" "1"
You can use crossproduct to get the counts of agreement. You can then reshape the
data, if required.
Example
object <- c("a", "a", "a", "b", "b", "c", "c", "c", "c", "d")
tframe <- c(1, 1, 2, 2, 3, 1, 2, 2, 3, 1)
# This will give you the counts
# Use code from Jean's comment
tab <- tcrossprod(table(object, tframe)>0)
# Reshape the data
tab[lower.tri(tab, TRUE)] <- NA
reshape2::melt(tab, na.rm=TRUE)
I have some values in my data frames #N/A that I want to convert to NA. I'm trying what seems like a straightforward grepl via lapply on the data frame, but its not working. Here's a simple example...
a = c("#N/A", "A", "B", "#N/A", "C")
b = c("d", "#N/A", "e", "f", "123")
df = as.data.frame(cbind(a,b))
lapply(df, function(x){x[grepl("#N/A", x)]=NA})
Which outputs:
$a
[1] NA
$b
[1] NA
Can someone point me in the right direction? I'd appreciate it.
Your function needs to return x as the return value.
Try:
lapply(df, function(x){x[grepl("#N/A", x)] <- NA; x})
$a
[1] <NA> A B <NA> C
Levels: #N/A A B C
$b
[1] d <NA> e f 123
Levels: #N/A 123 d e f
But you should really use gsub instead of grep:
lapply(df, function(x)gsub("#N/A", NA, x))
$a
[1] NA "A" "B" NA "C"
$b
[1] "d" NA "e" "f" "123"
A better (more flexible and possibly easier to maintain) solution might be:
replace <- function(x, ptn="#N/A") ifelse(x %in% ptn, NA, x)
lapply(df, replace)
$a
[1] NA 2 3 NA 4
$b
[1] 3 NA 4 5 2
You need to return x, and it's probably best to use apply in this case. Creating a data.frame with cbind is best avoided as well.
a = c("#N/A", "A", "B", "#N/A", "C")
b = c("d", "#N/A", "e", "f", "123")
df = data.frame(a=a, b=b, stringsAsFactors = FALSE)
str(df)
apply(df, 2, function(x){x[grepl("#N/A", x)] <- NA; return(x)})
If you are reading this data in from a CSV/tab delimited file, just set na.strings = "#N/A".
read.table("my file.csv", na.strings = "#N/A")
Update from comment: or maybe na.strings = c("#N/A", "#N/A#N/A").
Even if you are stuck with the case you described in your question, you still don't need grepl.
df <- data.frame(
a = c("#N/A", "A", "B", "#N/A", "C"),
b = c("d", "#N/A", "e", "f", "123")
)
df[] <- lapply(
df,
function(x)
{
x[x == "#N/A"] <- NA
x
}
)
df
## a b
## 1 <NA> d
## 2 A <NA>
## 3 B e
## 4 <NA> f
## 5 C 123
As per your example in the question, you don't need any types of apply loops, just do
df[df == "#N/A"] <- NA
As per cases when you have #N/A#N/A (although you didn't provide such data), another way to solve this would be
df[sapply(df, function(x) grepl("#N/A", x))] <- NA
In both cases the data itself will be updated, rather just printed to consule