Split one column into two - r

I have a large data.table of genotypes (260,000 rows by 1000 columns). The rows are markers and the columns are the subjects. The data looks like this:
ID1 ID2 ID3 ID4
M1: CC CC TC CC
M2: GG GG GG GG
M3: TT TT TT TT
M4: TG TG TG TG
M5: TT TT TT TT
M6: TT TT TT TT
I need to split each genotype so that I have each allele in its own column like this:
V1 V2 V3 V4 V5 V6 V7 V8
M1: C C C C T C C C
M2: G G G G G G G G
M3: T T T T T T T T
M4: T G T G T G T G
M5: T T T T T T T T
M6: T T T T T T T T
I have come up with two solutions, both of which work on a subset of the data, but breaks down on the entire data set due to memory issues or some internal error of data.table that I dont understand.
I used strsplit on each column and stored it to a list, then used do.call to merge them all. I also parallelized it using the foreach function
ids <- colnames(DT)
gene.split <- function(i) {
as.data.table(do.call(rbind,strsplit(as.vector(eval(parse(text=paste("DT$",ids[i])))), split = "")))
}
all.gene <- foreach(i=1:length(ids)) %dopar% gene.split(i)
do.call(cbind,all.gene)
On 4 cores this breaks down due to memory issues.
The second solution is based on a similar problem which uses the set function:
out_names <- paste("V", 1:(2*ncol(DT)), sep="_")
invar1 <- names(DT)
for (i in seq_along(invar1)) {
set(DT, i=NULL, j=out_names[2*i-1], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,1])
set(DT, i=NULL, j=out_names[2*i], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,2])
}
which works on a few columns but then I get the following error if I try using the entire dataset:
Error in set(DT, i = NULL, j = out_names[2 * i - 1], value = do.call(rbind, :
Internal logical error. DT passed to assign has not been allocated enough column slots. l=163, tl=163, adding 1
Am I going about this the wrong way?

Here is an approach using data.table::set and substr (not strsplit)
Using #jbaums example data l
# coerce to `data.table` without a copy
setDT(l)
# over allocate columns so that `data.table` can assign by reference
# this will stop the error you were seeing
alloc.col(l,3000)
out_names <- paste("V", 1:(2*ncol(l)), sep="_")
invar1 <- names(l)
for (i in seq_along(invar1)) {
set(l, i=NULL, j=out_names[2*i-1], value=substr(l[[invar1[i]]],1,1))
set(l, i=NULL, j=out_names[2*i], value=substr(l[[invar1[i]]],2,2))
}
The final step took 37 seconds on my Windows 7 i7 2600 machine with 8GB ram
In your example you run strsplit twice (and use do.call(rbind....)) --> not efficient.
Some benchmarking of possible approaches to the splitting....
microbenchmark(substr(l[[invar1[1L]]],2,2), sapply(strsplit(l[[invar1[1L]]],''),`[`,2L),do.call(rbind, strsplit(l[[invar1[i]]], split = ""))[,2], times=5)
Unit: milliseconds
expr min lq median uq max neval
substr(l[[invar1[1L]]], 2, 2) 14.10669 14.35571 14.57485 15.78283 193.9125 5
sapply(strsplit(l[[invar1[1L]]], ""), `[`, 2L) 345.92969 1420.03907 1944.33873 3864.82876 5371.6130 5
do.call(rbind, strsplit(l[[invar1[i]]], split = ""))[, 2] 3318.70878 4131.38551 4155.06126 5269.92745 8414.4948 5

Here's a relatively fast approach - took ~80 sec (after dummy data creation) (Win 8.1 x64; i4770) but chewed up ~13 GB of RAM.
# Creating initial data
pairs <- c(outer(c('C', 'T', 'G', 'A'), c('C', 'T', 'G', 'A'), 'paste0'))
l <- replicate(1000, sample(pairs, 260000, replace=TRUE), simplify=FALSE)
system.time({
v <- do.call(paste0, l)
rm(l); gc()
out <- do.call(rbind, strsplit(v, ''))
rm(v); gc()
})
# user system elapsed
# 79.07 1.24 80.33
str(out)
# chr [1:260000, 1:2000] "A" "C" "C" "C" ...

Here's a way to do this for a data frame x:
do.call(cbind,
lapply(x,
function(i) do.call(rbind, strsplit(as.character(i), split=''))
)
)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "C" "C" "C" "C" "T" "C" "C" "C"
[2,] "G" "G" "G" "G" "G" "G" "G" "G"
[3,] "T" "T" "T" "T" "T" "T" "T" "T"
[4,] "T" "G" "T" "G" "T" "G" "T" "G"
[5,] "T" "T" "T" "T" "T" "T" "T" "T"
[6,] "T" "T" "T" "T" "T" "T" "T" "T"
Each column is split into characters, and then r-bound together. This gives a list of columns, which are then passed to cbind.

## make a small data.table for testing
dd <- data.table(ID1=c("CC","TG"),ID2=c("CC","TG"), ID3=c("TC","TG"))
dd
## ID1 ID2 ID3
## 1: CC CC TC
## 2: TG TG TG
## the first base
apply(dd,1:2,function(e) strsplit(e,split='')[[1]][1])
## ID1 ID2 ID3
## [1,] "C" "C" "T"
## [2,] "T" "T" "T"
## the second base
apply(dd,1:2,function(e) strsplit(e,split='')[[1]][2])
## ID1 ID2 ID3
## [1,] "C" "C" "C"
## [2,] "G" "G" "G"
## These results are in matrix, if you need data.table use as.data.table to convert them back.

Related

Compact Letter Display from a matrix of significancies or by hand

I am running a multiple pairwise comparison in R. I'm using the survival package survminer. I'm using the function:
pairwise_survdiff {survminer}
It gives the pairwise comparisons with significance as expected, but doesn't seem to have a way to give a compact letter display (CLD) of the results. I'm looking at pairs of 19 levels. I ended up printing the results, putting them into excel by hand and then doing letters by hand. But now I need to do it again and am hoping for an easier way.
Can I have R do a CLD from the pairwise_survdiff {survminer} results directly?
Baring that
Is there a way to get it to print results into a table that can be read by a spreadsheet?
If I make the logic matrix by hand, how do I have R take that and turn it into a CLD?
And 4) If I'm doing it all by hand, I'm wondering if there is a more compact method of showing this list of comparisons. Can I eliminate any of these letters due to redundancy?
hand made CLD for comparisons
Thank you
Here's the example from survminer
library(survminer)
library(multcomp)
library(tidyr)
data(myeloma)
res <- pairwise_survdiff(Surv(time, event) ~ molecular_group,
data = myeloma)
Looking at the internals of the glht.summary method from the multcomp package, we create the lvl_order vector which identifies the ordering of the levels of x from smallest to largest.
x <- myeloma$molecular_group
levs <- levels(x)
y <- Surv(myeloma$time, myeloma$event)
lvl_order <- levels(x)[order(tapply(as.numeric(y)[1:length(x)],
x, mean))]
Then we can re-arrange the p-values from the res object into a matrix. mycomps is a matrix of the two sides of the paired comparisons. The signif vector is logical indicating whether differences are significant or not.
comps <- as_tibble(res$p.value, rownames="row") %>%
pivot_longer(-row, names_to="col", values_to="p") %>%
na.omit()
mycomps <- as.matrix(comps[,1:2])
signif <- comps$p < .05
Then, you can use the insert_absorb internal function to create the letters:
multcomp:::insert_absorb(signif,
decreasing=FALSE,
comps=mycomps,
lvl_order=lvl_order)
# $Letters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a" "b" "ab" "b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $monospacedLetters
# MAF Proliferation Cyclin D-2 MMSET Hyperdiploid
# "ab" "a " " b" "ab" " b"
# Low bone disease Cyclin D-1
# "ab" "ab"
#
# $LetterMatrix
# a b
# MAF TRUE TRUE
# Proliferation TRUE FALSE
# Cyclin D-2 FALSE TRUE
# MMSET TRUE TRUE
# Hyperdiploid FALSE TRUE
# Low bone disease TRUE TRUE
# Cyclin D-1 TRUE TRUE
#
# $aLetters
# [1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v"
# [23] "w" "x" "y" "z" "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
# [45] "S" "T" "U" "V" "W" "X" "Y" "Z"
#
# $aseparator
# [1] "."
#
# attr(,"class")
# [1] "multcompLetters"

extracting data from two columns

I have a data frame that looks like this.
It refers to words and their structure
df <- data.frame(word = c("pokkoitta", "demna", "ningatinggo ", "tengkeam", "bampana", "njam"), structure = c("CvC:vvC:v", "CvCCv", "CvCvCvNCv", "CvNCvvC", "CvNCvCv" , "NCvC"))
The second column indicates the structure of the first column. If in the second column a C:, NC or CC combination occurs, I need to extract from the first column, which these refer to.
So I would need:
kk C:
kk C:
mn CC
ngg NC
ngk NC
mp NC
nj NC
One thing that needs to be taken into account is that a simple count does not work on 2 letters in the left column, which correspond to 1 letter in the right columne, namely ng|sy|kh = C (not CC, as they represent one phoneme)
Also, in one word, more than one of these combinations can occur
Thx
Update:
This would be the matching pattern with regex:
(nj|ngk|ngg|nc|nt|nd|mp|mb) = NC
(ng|sy|kh) = C
[b-df-hj-np-tv-xz])\\1+) = C:
([b-df-hj-np-tv-z]) = C
(') = :
((a|e|i|o|u)\\1+) = v:
(a|e|i|o|u) = v
Interesting problem. I might have just re-invented the algorithm used to find those structures, but it seems to work.
df <- data.frame(
word=c("pokkoitta", "demna", "ningatinggo", "tengkeam", "bampana", "njam"),
structure=c("CvC:vvC:v", "CvCCv", "CvCvCvNCv", "CvNCvvC", "CvNCvCv", "NCvC"),
stringsAsFactors=FALSE)
pat <- data.frame(str=c("NC", "C", "C:", "C", "v:", "v"),
rex=c("nj|ngk|ngg|nc|nt|nd|mp|mb",
"ng|sy|kh",
"([b-df-hj-np-tv-xz])\\1+",
"[b-df-hj-np-tv-z]",
"(a|e|i|o|u)\\1+",
"a|e|i|o|u"), stringsAsFactors=FALSE)
xs <- xw <- df[,1]
for (i in 1:nrow(pat)) {
rx <- gregexpr(pat[i, 2], xs)
mc <- regmatches(xs, rx)
mp <- sapply(mc, function(x) format(paste("", x), width=6))
mc[lengths(mc) != 0] <- mp[lengths(mc) != 0]
regmatches(xw, rx) <- mc
regmatches(xs, rx) <- paste("", format(pat[i, 1], width=5))
}
phon <- trimws(cbind(word=xw, structure=xs))
phon <- apply(phon, 1, strsplit, " +")
phon <- lapply(phon, function(x) do.call(cbind, x))
head(phon, 3)
# [[1]]
# word structure
# [1,] "p" "C"
# [2,] "o" "v"
# [3,] "kk" "C:"
# [4,] "o" "v"
# [5,] "i" "v"
# [6,] "tt" "C:"
# [7,] "a" "v"
#
# [[2]]
# word structure
# [1,] "d" "C"
# [2,] "e" "v"
# [3,] "m" "C"
# [4,] "n" "C"
# [5,] "a" "v"
#
# [[3]]
# word structure
# [1,] "n" "C"
# [2,] "i" "v"
# [3,] "ng" "C"
# [4,] "a" "v"
# [5,] "t" "C"
# [6,] "i" "v"
# [7,] "ngg" "NC"
# [8,] "o" "v"

Determining position in a list (variable) of a case in R

How to determine numerical position of a case in a vector?
I have a variable e.g., Var with multiple cases:
Case_1 <- 22
Case_2 <- 33
Case_3 <- 155
Case_4 <- 321
Var <- cbind(c(Case_1,Case_2,Case_3, Case_4))
names(Var) <- c("Case_1","Case_2","Case_3", "Case_4")
Var <- sort(Var, decreasing = TRUE)
I want to know the position of a case in this list (in this instance – Case 4 is position 1, Case_3 is position 2, etc. How can I do it in R?
Here are 2 ways:
Case_1 <- 22
Case_2 <- 33
Case_3 <- 155
Case_4 <- 321
Var <- cbind(c(Case_1,Case_2,Case_3, Case_4))
names(Var) <- c("Case_1","Case_2","Case_3", "Case_4")
Var <- sort(Var, decreasing = TRUE)
# By Value
n <- which(Var == 321)
cat("The position with a value of 321 is", n)
The position with a value of 321 is 1
# By Name
n <- which(names(Var) == "Case_1")
cat("The position with of Case_1 is", n)
The position with of Case_1 is 4
You can use the match() function:
myvec <- letters
> myvec
[1] "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k" "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z"
> match('e',myvec)
[1] 5
And you get the index this way.
As for your example:
Var <- read.table(header=F, text= "Case_1 22
Case_2 33
Case_3 155
Case_4 321")
Var <- Var[order(Var$V1, decreasing=T), ] #sort it decreasing
> match('Case_4',Var$V1)
[1] 1

Analyze table in R to count nucleotide frequencies

I am quite new to R and I have a table of strings, I believe, that I extracted from a text file that contains a list of nucleotides (ex. "AGCTGTCATGCT.....").
Here are the first two rows of the text file to help as an example:
AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGCTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC
I need to count every "A" in the sequence by incrementing its variable, a. The same applies for G, C, and T (variables to increment are g, c ,t respectively).
At the end of the "for" loop I want the number of times "A" "G" "C" and "T" nucleotides occurred so I can calculate the dinucleotide frequencies, and hoepfully the transition matrix. My code is below, it doesn't work, it just returns each variable being equal to 0 which is wrong. Please help, thanks!
#I saved the newest version to a text file of the nucleotides
dnaseq <- read.table("/My path file/ecoli.txt")
g=0
c=0
a=0
t=0
for(i in dnaseq[[1]]){
if(i=="A") (inc(a)<-1)
if(i=="G") (inc(g)<-1)
if(i=="C") (inc(c)<-1)
if(i=="T") (inc(t)<-1)
}
a
g
c
t
The simplest way to get the counts of each nucleotide (or any kind of letter) is to use the table and strsplit functions. For example:
myseq = "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAGC"
# split it into a vector of individual characters:
strsplit(myseq, "")[[1]]
# [1] "A" "G" "C" "T" "T" "T" "T" "C" "A" "T" "T" "C" "T" "G" "A" "C" "T" "G" "C" "A" "A" "C" "G" "G" "G" "C" "A" "A" "T" "A" "T" "G" "T" "C" "T" "C" "T" "G" "T"
# [40] "G" "T" "G" "G" "A" "T" "T" "A" "A" "A" "A" "A" "A" "A" "G" "A" "G" "T" "G" "T" "C" "T" "G" "A" "T" "A" "G" "C" "A" "G" "C"
# count the frequencies of each
table(strsplit(myseq, "")[[1]])
# A C G T
# 20 12 17 21
Now, if you don't care about the difference between one line and the next (if this is just one long sequence in ecoli.txt) then you want to combine the file into one long string first:
table(strsplit(paste(dnaseq[[1]], collapse = ""), "")[[1]])
That's the one line solution, but it might be clearer to see it in three lines:
combined.seq = paste(dnaseq[[1]], collapse = "")
combined.seq.vector = strsplit(combined.seq, "")
frequencies = table(combined.seq.vector)
If you're wondering what was wrong with your original code- first, I don't know where the inc function comes from (and why it wasn't throwing an error: are you sure dnaseq[[1]] has length greater than 0?) but in any case, you weren't iterating over the sequence, you were iterating over the lines. i was never going to be a single character like A or T, it was always going to be a full line.
In any case, the solution with collapse, table and strsplit is both more concise and computationally efficient than a for loop (or a pair of nested for loops, which is what you would need).
You may use the following code which calls the str_count function (that counts the number of occurrences of a fixed text pattern) from the stringr package. It should work faster than the other solution which splits the character string into one-letter substrings.
require('stringr') # call install.packages('stringr') to download the package first
# read the text file (each text line will be a separate string):
dnaseq <- readLines("path_to_file.txt")
# merge text lines into one string:
dnaseq <- str_c(dnaseq, collapse="")
# count the number of occurrences of each nucleotide:
sapply(c("A", "G", "C", "T"), function(nuc)
str_count(dnaseq, fixed(nuc)))
Note that this solution may easily br extended to the length > 1 subsequence finding task (just change the search pattern in sapply(), e.g. to as.character(outer(c("A", "G", "C", "T"), c("A", "G", "C", "T"), str_c)), which generates all pairs of nucleotides).
However, note that detecting AGA in AGAGA will report only 1 occurrence as str_count() does not take overlapping patterns into account.
I am assuming that your nucleotide sequence is in a character vector of length one. If you are looking for the dinucleotide frequencies and a transition matrix, here is one solution:
dnaseq <- "AGCTTTTCATTCTGACTGCAACGGGCAATATGTCTCTGTGTGGATTAAAAAAAGAGTGTCTGATAGCAG
CTTCTGAACTGGTTACCTGCCGTGAGTAAATTAAAATTTTATTGACTTAGGTCACTAAATACTTTAAC"
## list of nucleotides
nuc <- c("A","T","G","C")
## all distinct dinucleotides
nuc_comb <- expand.grid(nuc,nuc)
nuc_comb$two <- paste(nuc_comb$Var1, nuc$Var2, sep = "")
# Var1 Var2 two
# 1 A A AA
# 2 T A TA
# 3 G A GA
# 4 C A CA
# 5 A T AT
# 6 T T TT
# 7 G T GT
# 8 C T CT
# 9 A G AG
# 10 T G TG
# 11 G G GG
# 12 C G CG
# 13 A C AC
# 14 T C TC
# 15 G C GC
# 16 C C CC
## Using `vapply` and regular expressions to count dinucleotide sequences:
nuc_comb$freq <- vapply(nuc_comb$two,
function(x) length(gregexpr(x, dnaseq, fixed = TRUE)[[1]]),
integer(1))
# AA TA GA CA AT TT GT CT AG TG GG CG AC TC GC CC
# 11 11 7 5 9 12 9 13 7 13 4 2 8 7 5 2
## label and reshape to matrix/table
dinuc_df <- reshape(nuc_comb, direction = "wide",
idvar = "Var1", timevar = "Var2", drop = "two")
dinuc_mat <- as.matrix(dinuc_df_wide[-1])
rownames(dinuc_mat) <- colnames(dinuc_mat) <- nuc
# A T G C
# A 11 9 7 8
# T 11 12 13 7
# G 7 9 4 5
# C 5 13 2 2
## get margin proportions for transition matrix
## probability of moving from nucleotide in row to nucleotide in column)
dinuc_tab <- prop.table(dinuc_mat, 1)
# A T G C
# A 0.3142857 0.2571429 0.20000000 0.22857143
# T 0.2558140 0.2790698 0.30232558 0.16279070
# G 0.2800000 0.3600000 0.16000000 0.20000000
# C 0.2272727 0.5909091 0.09090909 0.09090909

Shuffling a vector - all possible outcomes of sample()?

I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.
I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]
Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a
Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...

Resources