This question already has answers here:
How to split a string into substrings of a given length? [duplicate]
(5 answers)
Closed 2 years ago.
i need to create a list , containing one vector for each gene.Vectors should be the result of using func on each gene.
Hi and welcome to stack overflow. Since you didn't provide any data to test with I made some dummy data as follows. So these are examples of the dummy genes:
valid_codons <- c("aaa", "aac", "aag", "aat", "aca", "acc", "acg", "act",
"aga", "agc", "agg", "agt", "ata", "atc", "atg", "att", "caa", "cac",
"cag", "cat", "cca", "ccc", "ccg", "cct", "cga", "cgc", "cgg", "cgt",
"cta", "ctc", "ctg", "ctt", "gaa", "gac", "gag", "gat", "gca", "gcc",
"gcg", "gct", "gga", "ggc", "ggg", "ggt", "gta", "gtc", "gtg", "gtt",
"taa", "tac", "tag", "tat", "tca", "tcc", "tcg", "tct", "tga", "tgc",
"tgg", "tgt", "tta", "ttc", "ttg", "ttt")
genes <- replicate(3800, {
paste0(sample(valid_codons, sample(5:20, 1), replace = TRUE), collapse = "")
})
print(head(genes, 3))
#> [1] "gggtacaaagtgcat"
#> [2] "cggaaaaccggggcgtgtccg"
#> [3] "ggaccactattactctcctcgggtatagatacccgaggt"
I'm assuming from the function that the data structure you're working with are character vectors, which I made like this:
genes_chars <- strsplit(genes, "")
print(head(genes_chars, 2))
#> [[1]]
#> [1] "g" "g" "g" "t" "a" "c" "a" "a" "a" "g" "t" "g" "c" "a" "t"
#>
#> [[2]]
#> [1] "c" "g" "g" "a" "a" "a" "a" "c" "c" "g" "g" "g" "g" "c" "g" "t" "g" "t" "c"
#> [20] "c" "g"
Now getting to your actual question, I'm wrapping your provided codon_count() function in a lapply loop to calculate the result.
codon_count <- function(gene) {
answer <- rep(0, 64)
names(answer) <- valid_codons
for(i in seq(from=1, to=length(gene), by=3)) {
codon <- tolower(paste0(gene[i], gene[i+1], gene[i+2]))
answer[codon] <- answer[codon] + 1
}
return(answer[valid_codons])
}
result <- lapply(genes_chars, codon_count)
print(head(result, 2))
#> [[1]]
#> aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat
#> 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#> cca ccc ccg cct cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct
#> 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat tca tcc tcg tct tga tgc tgg tgt
#> 0 0 1 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0
#> tta ttc ttg ttt
#> 0 0 0 0
#>
#> [[2]]
#> aaa aac aag aat aca acc acg act aga agc agg agt ata atc atg att caa cac cag cat
#> 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0
#> cca ccc ccg cct cga cgc cgg cgt cta ctc ctg ctt gaa gac gag gat gca gcc gcg gct
#> 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0
#> gga ggc ggg ggt gta gtc gtg gtt taa tac tag tat tca tcc tcg tct tga tgc tgg tgt
#> 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1
#> tta ttc ttg ttt
#> 0 0 0 0
We can check that the dimensions are correct with length() and lengths().
unique(lengths(result))
#> [1] 64
length(result)
#> [1] 3800
However, I think that the code below is somewhat more efficient.
# Split character vectors into groups of three
# Based on https://stackoverflow.com/questions/11619616/how-to-split-a-string-into-substrings-of-a-given-length
splitgenes <- strsplit(genes, "(?<=.{3})", perl = TRUE)
result2 <- t(vapply(splitgenes, function(gene) {
table(factor(gene, valid_codons))
}, numeric(length(valid_codons))))
# What are the result2 dimensions and content?
dim(result2)
#> [1] 3800 64
result2[1:5, 1:5]
#> aaa aac aag aat aca
#> [1,] 1 0 0 0 0
#> [2,] 1 0 0 0 0
#> [3,] 0 0 0 0 0
#> [4,] 0 0 0 0 0
#> [5,] 0 1 0 0 0
EDIT:
This is the for-loop equivalent of the lapply statement:
result <- list()
for (i in seq_along(genes_chars)) {
result[[i]] <- codon_count(genes_chars[[i]])
}
Note however that this is less efficient.
Related
Say I have a data frame with a list of names and the companies they have as clients:
name <- c("Anne", "Anne", "Mary", "Mary", "Mary", "Joe", "Joe", "Joe", "David", "David", "David", "David", "David")
company <- c("A", "B", "C", "D", "E", "A", "B", "C", "D", "E", "F", "G", "H")
df1 <- data.frame(name, company)
Then I have a second data frame where I have companies who are working together on projects:
company1 <- c("A", "B", "C", "D", "E", "F", "G", "H")
company2 <- c("B", "C", "E", "E", "G", "A", "B", "C")
df2 <- data.frame(company1, company2)
My preferred outcome would be something like this:
name A B C D E F G No of sets
1 Anne 1 1 0 0 0 0 0 1
2 David 0 0 0 1 1 1 1 1
3 Joe 1 1 1 0 0 0 0 2
4 Mary 0 0 1 1 1 0 0 1
So this counts the number of "sets" that match the sets in df2. For example, Anne has A and B with 1s, and it matches row 1 in df2. Joe has A, B, C, and both A and B and B and C are rows in df2, thus Joe's row has two matches.
I think this might work for you. Let me know. It doesn't match your expected result because you didn't include H, which I presumed to be a typo? Likewise, should Mary's No_of_sets also equal 2?
# Tabulate the frequency of name x company combinations
r <- as.data.frame.matrix(table(df1$name, df1$company))
r
#> A B C D E F G H
#> Anne 1 1 0 0 0 0 0 0
#> David 0 0 0 1 1 1 1 1
#> Joe 1 1 1 0 0 0 0 0
#> Mary 0 0 1 1 1 0 0 0
# Get "sets" of companies working together
s <- paste(df2$company1, df2$company2)
s
#> [1] "A B" "B C" "C E" "D E" "E G" "F A" "G B" "H C"
# Get all potential company sets associated with each name
m <- apply(r, MARGIN = 1, FUN = function(x) combn(names(which(x==1)), 2))
# Intersect sets of companies potentially working together (m) with
# companies actually working together (df2)
# (You could use a nested apply here, but I thought that it
# would be too opaque. Looping is a little more clear.)
for(name in rownames(r)){
pairs <- m[[name]]
ppairs <- apply(pairs, 2, paste0, collapse = " ")
r[which(rownames(r)==name),"No_of_sets"] <- length(intersect(ppairs, s))
}
r
#> A B C D E F G H No_of_sets
#> Anne 1 1 0 0 0 0 0 0 1
#> David 0 0 0 1 1 1 1 1 2
#> Joe 1 1 1 0 0 0 0 0 2
#> Mary 0 0 1 1 1 0 0 0 2
Created on 2021-10-19 by the reprex package (v2.0.1)
Edit: Let's say there's a chance that one name isn't working with more than one company. In that case, you'd need to add a conditional to account for this in both steps. First, new data... notice that the name "Solo" is only working with one company.
r
#> A B C D E F G H
#> Anne 1 1 0 0 0 0 0 0
#> David 0 0 0 1 1 1 1 1
#> Joe 1 1 1 0 0 0 0 0
#> Mary 0 0 1 1 1 0 0 0
#> Solo 1 0 0 0 0 0 0 0
m <- apply(r, MARGIN = 1, FUN = function(x)
if(length(names(which(x==1)))>1) {
combn(names(which(x==1)), 2)
} else names(which(x==1))
)
m
#> $Anne
#> [,1]
#> [1,] "A"
#> [2,] "B"
#>
#> $David
#> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
#> [1,] "D" "D" "D" "D" "E" "E" "E" "F" "F" "G"
#> [2,] "E" "F" "G" "H" "F" "G" "H" "G" "H" "H"
#>
#> $Joe
#> [,1] [,2] [,3]
#> [1,] "A" "A" "B"
#> [2,] "B" "C" "C"
#>
#> $Mary
#> [,1] [,2] [,3]
#> [1,] "C" "C" "D"
#> [2,] "D" "E" "E"
#>
#> $Solo
#> [1] "A"
for(name in rownames(r)){
pairs <- m[[name]]
if(length(pairs)>1){
ppairs <- apply(pairs, 2, paste0, collapse = " ")
} else ppairs <- pairs
r[which(rownames(r)==name),"No_of_sets"] <- length(intersect(ppairs, s))
}
r
#> A B C D E F G H No_of_sets
#> Anne 1 1 0 0 0 0 0 0 1
#> David 0 0 0 1 1 1 1 1 2
#> Joe 1 1 1 0 0 0 0 0 2
#> Mary 0 0 1 1 1 0 0 0 2
#> Solo 1 0 0 0 0 0 0 0 0
I have a df in which every columns represent an event and in cells there are the individuals, like this:
df=data.frame(topic1=c("a", "b","c", "d"), topic2=c("e","f", "g", "a"), topic3=c("b","c","g","h"))
I need to transform it in adjacency df, like this:
topic1 topic2 topic3
a 1 1 0
b 1 0 1
c 1 0 1
d 1 0 0
e 0 1 0
f 0 1 0
g 0 1 1
h 0 0 1
THX!
Form levs containing the levels in sorted order and then for each column of df determine which levs are in it. This gives a logical matrix which we can convert to numeric using +.
levs <- sort(unique(unlist(df))) # a b c d e f g h
+ sapply(df, function(x) levs %in% x)
giving:
topic1 topic2 topic3
[1,] 1 1 0
[2,] 1 0 1
[3,] 1 0 1
[4,] 1 0 0
[5,] 0 1 0
[6,] 0 1 0
[7,] 0 1 1
[8,] 0 0 1
The last line could be written even more compactly as:
+ sapply(df, `%in%`, x = levs)
I have following text, contents are as follows
#------------------
# CONTENTS OF TEXT
#------------------
H01, H04, G02, G06,
H01, H02, G02, H05,
G01, H04, H01
G09, G05
I want to convert this data in to binary matrix. I want the output to be like this
H01 H02 H04 H05 G01 G02 G05 G06 G09
1 0 1 0 0 1 0 1 0
1 1 0 1 0 1 0 0 0
1 0 1 0 1 0 0 0 0
0 0 0 0 0 0 1 0 1
Please help
You can do:
d <- read.table(header=FALSE, sep='ยง', stringsAsFactors = FALSE, text=
'H01, H04, G02, G06,
H01, H02, G02, H05,
G01, H04, H01
G09, G05')
s <- sort(unique(unlist(strsplit(d$V1, ', *'))))
m <- sapply(s, grepl, x=d$V1, fixed=TRUE)
# > m
# G01 G02 G05 G06 G09 H01 H02 H04 H05
# [1,] FALSE TRUE FALSE TRUE FALSE TRUE FALSE TRUE FALSE
# [2,] FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE TRUE
# [3,] TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
# [4,] FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE
m[] <- as.integer(m)
# > m
# G01 G02 G05 G06 G09 H01 H02 H04 H05
# [1,] 0 1 0 1 0 1 0 1 0
# [2,] 0 1 0 0 0 1 1 0 1
# [3,] 1 0 0 0 0 1 0 1 0
# [4,] 0 0 1 0 1 0 0 0 0
Another idea using #jogo data:
library(dplyr)
library(tidyr)
d %>%
mutate(V1 = stringi::stri_extract_all_words(V1), V2 = 1) %>%
unnest(V1, .id = "id") %>%
spread(V1, V2, fill = 0)
Which gives:
# id G01 G02 G05 G06 G09 H01 H02 H04 H05
#1 1 0 1 0 1 0 1 0 1 0
#2 2 0 1 0 0 0 1 1 0 1
#3 3 1 0 0 0 0 1 0 1 0
#4 4 0 0 1 0 1 0 0 0 0
I have a file like this.
"1" 10 2 0 0 0 0 0 0 0 0 0 0 0 4 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0
"2" 10 3 6 17 11 15 8 17 14 1 42 21 22 15 9 9 17 12 9 16 4 8 12 29 23 11 0 0 0 0
"3" 10 4 39 39 14 33 16 23 37 21 29 22 46 26 16 26 21 22 21 10 16 3 10 14 20 12 6 0 0 0
"4" 100 18 0 0 0 1 0 0 0 0 0 0 2 0 0 1 0 2 8 5 2 1 2 4 9 6 4 3 0 0
.....................
What I want to do is, replace the values from column 4 onwards by characters, i.e. if value is between 0 to 10, then it will be replaced by character 'a' and if it is between 10 to 20, it will be replaced by character b and so on.
For example, the output file will be of the form,
"1" 10 2 0 0 0 0 0 0 0 0 0 0 0 a 0 0 a 0 0 0 0 0 a 0 0 0 0 0 0 0
.............................
How can I do it in R? Is there someway I can automate the assigning of characters because currently I am using two for loops and harcoding the values by the range.
Edit: My approach:
> for ( i in 1:nrow(x) )
+ for ( j in j:ncol(x) )
+ {
+ if (x[i,j] < 10 && x[i,j] > 0 )
+ x[i,j] = a
+ else if ( x[i,j] < 20 && x[i,j] > 10 )
+ x[i,j] = b
+ }
The above is my approach. This is showing an error in conditions, and I know will take a lot of time since it involves usage of two for loops.
One possible solution is to create a dummy data set to match against, and then match all non zero values to it (assuming df is your data set)
matchData <- data.frame(lets = c(0, rep(letters, each = 10)),
nums = c(0, seq_len(length(letters)*10)))
df[, -seq_len(3)] <- sapply(df[, -seq_len(3)], function(x) matchData$lets[match(x, matchData$nums)])
df
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 1 10 2 0 0 0 0 0 0 0 0 0 0 0 a 0 0 a 0 0 0 0 0 a 0
# 2 2 10 3 a b b b a b b a e c c b a a b b a b a a b c
# 3 3 10 4 d d b d b c d c c c e c b c c c c a b a a b
# 4 4 100 18 0 0 0 a 0 0 0 0 0 0 a 0 0 a 0 a a a a a a a
# V26 V27 V28 V29 V30 V31
# 1 0 0 0 0 0 0
# 2 c b 0 0 0 0
# 3 b b a 0 0 0
# 4 a a a a 0 0
I think the following will be close, just a quick answer that hopefully helps you along. You'd have to apply over this method to do it for the entire dataframe. Also there's coercion that I didn't handle here, so when testing on a single row everything got coerced into a char.
The basic thought is that if you want 1-10 to correspond to "a", 11-20 to correspond to "b", then we can get that by dividing the number by 10, then calling ceiling. 1-10 then maps to 1, 11-20 then maps to 2, and so forth. letters[1] maps to "a", letters[2] maps to "b", ect so we get the desired functionality.
#everything coerced to char, I know
testVect<-c("2", 10, 3, 6, 17, 11, 15, 8 ,17, 14, 1, 42, 21, 22, 15, 9, 9, 17, 12, 9, 16, 4, 8, 12 ,29, 23, 11, 0, 0 ,0 ,0)
testAfter4<-sapply(testVect[4:length(testVect)],
function(entry) {
ifelse(entry==0, 0, letters[ceiling(as.numeric(entry)/10)])
} )
#need to cast entry back to numeric as it was coerced to char when initializing testVect
testVect[4:length(testVect)]<-testAfter4
testVect
#[1] "2" "10" "3" "a" "b" "b" "b" "a" "b" "b" "a" "e" "c" "c" "b"
#[16] "a" "a" "b" "b" "a" "b" "a" "a" "b" "c" "c" "b" "0" "0" "0"
#[31] "0"
You can use the ascii codes and an offset based on your value/10 (without remainder)...
mydat = c(10,2,0,19,20,19,0,0)
# Convert a number divided by 10 to its offset (hat tip to MrFlick for `letters`
# this uses the cryptic looking %/% operator for division without remainder
char10 = letters[1+(md %/% 10)]
# convert zeroes, and if desired replace column 1:4 with original data
char10[md==0] = 0
Output:
> char10
[1] "b" "a" "0" "b" "c" "b" "0" "0"
I have following type of data, means combination of factors
P1 <- c("a", "a", "a", "a", "b", "b", "b", "c", "c", "d")
P2 <- c("a", "b", "c", "d", "b", "c", "d", "c", "d", "d")
myfactors <- data.frame(P1, P2)
P1 P2
1 a a
2 a b
3 a c
4 a d
5 b b
6 b c
7 b d
8 c c
9 c d
10 d d
In real word the factors might be any number, I am trying write a function that can be applicable to any level of the factors. I want to set contrasts all combinations available in the data set. for example in this data set a-b, a-c,a-d, b-c,b-d, c-d. The contrast rule here.
for example for "a-b" is if P1 = P2 = a or b the coefficient = -1,
if P1=a, P2= b or P1= b, P2 = a, the coefficient = 2,
else coefficient = 0
The output coefficient matrix will like the following:
P1 P2 a-b a-c a-d b-c b-d c-d
a a -1 -1 -1 0 0 0
a b 2 0 0 0 0 0
a c 0 2 0 0 0 0
a d 0 0 2 0 0 0
b b 1 0 0 -1 -1 0
b c 0 0 0 2 0 0
b d 0 0 0 0 2 0
c c 0 1 0 0 0 -1
c d 0 0 0 -1 0 2
d d 0 0 -1 0 -1 -1
As the function I am thinking is flexible one, if I will apply to the following dataset,
P1 <- c("CI", "CI", "CI", "CD", "CD", "CK", "CK")
P2 <- c("CI", "CD", "CK", "CD", "CK", "CK", "CI")
mydf2 <- data.frame(P1, P2)
mydf2
P1 P2
1 CI CI
2 CI CD
3 CI CK
4 CD CD
5 CD CK
6 CK CK
7 CK CI
The expected coefficient matrix for this dataframe is:
P1 P2 CI-CD CI-CK CD-CK CK-CI
CI CI -1 -1 0 -1
CI CD 2 0 0 0
CI CK 0 2 0 0
CD CD -1 0 -1 0
CD CK 0 0 2 0
CK CK 0 -1 -1 -1
CK CI 0 0 0 2
I tried several ways but could not come to successful program.
EDITS:
(1) I am not testing all possible combinations, the combination that only appear in P1 and P2 are tested
(2) I intend to develop solution not only to this instance, but of general application. for example myfactors dataframe above.
You didn't supply a reason for your particular choice of the 6 ordered combinations of P1 and P2 values, so I just ran through them all:
combos <- cbind( combn(unique(c(P2, P1)), 2), combn(unique(c(P2, P1)), 2)[2:1, ])
combos
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] "CI" "CI" "CD" "CD" "CK" "CK"
[2,] "CD" "CK" "CK" "CI" "CI" "CD"
As I went through the logic it seemed more compact to test for conditions 1) and 2) and just use Boolean math to return the results. If both conditins are untrue you get 0. I've check the entries that do not match yours and I think your construction was wrong in spots. You have 0 in the "CI-CK" row 7 and I think the answer by your rules should be 2.:
sapply(1:ncol(combos), function(x) with( mydf2,
2*( (P1==combos[1,x] & P2 == combos[2,x]) | (P2==combos[1,x] & P1 == combos[2,x])) -
(P1 == P2 & P1 %in% combos[,x]) ) )
#---------------
[,1] [,2] [,3] [,4] [,5] [,6]
[1,] -1 -1 0 -1 -1 0
[2,] 2 0 0 2 0 0
[3,] 0 2 0 0 2 0
[4,] -1 0 -1 -1 0 -1
[5,] 0 0 2 0 0 2
[6,] 0 -1 -1 0 -1 -1
[7,] 0 2 0 0 2 0
#------------------
mydf2[ , 3:8] <- sapply(1:ncol(combos), function(x) with( mydf2,
2*( (P1==combos[1,x] & P2 == combos[2,x]) | (P2==combos[1,x] & P1 == combos[2,x])) -
(P1 == P2 & P1 %in% combos[,x]) ) )
mydf2
#-----------------
P1 P2 CI-CD CI-CK CD-CK CD-CI CK-CI CK-CD
1 CI CI -1 -1 0 -1 -1 0
2 CI CD 2 0 0 2 0 0
3 CI CK 0 2 0 0 2 0
4 CD CD -1 0 -1 -1 0 -1
5 CD CK 0 0 2 0 0 2
6 CK CK 0 -1 -1 0 -1 -1
7 CK CI 0 2 0 0 2 0