extracting data from two columns - r

I have a data frame that looks like this.
It refers to words and their structure
df <- data.frame(word = c("pokkoitta", "demna", "ningatinggo ", "tengkeam", "bampana", "njam"), structure = c("CvC:vvC:v", "CvCCv", "CvCvCvNCv", "CvNCvvC", "CvNCvCv" , "NCvC"))
The second column indicates the structure of the first column. If in the second column a C:, NC or CC combination occurs, I need to extract from the first column, which these refer to.
So I would need:
kk C:
kk C:
mn CC
ngg NC
ngk NC
mp NC
nj NC
One thing that needs to be taken into account is that a simple count does not work on 2 letters in the left column, which correspond to 1 letter in the right columne, namely ng|sy|kh = C (not CC, as they represent one phoneme)
Also, in one word, more than one of these combinations can occur
Thx
Update:
This would be the matching pattern with regex:
(nj|ngk|ngg|nc|nt|nd|mp|mb) = NC
(ng|sy|kh) = C
[b-df-hj-np-tv-xz])\\1+) = C:
([b-df-hj-np-tv-z]) = C
(') = :
((a|e|i|o|u)\\1+) = v:
(a|e|i|o|u) = v

Interesting problem. I might have just re-invented the algorithm used to find those structures, but it seems to work.
df <- data.frame(
word=c("pokkoitta", "demna", "ningatinggo", "tengkeam", "bampana", "njam"),
structure=c("CvC:vvC:v", "CvCCv", "CvCvCvNCv", "CvNCvvC", "CvNCvCv", "NCvC"),
stringsAsFactors=FALSE)
pat <- data.frame(str=c("NC", "C", "C:", "C", "v:", "v"),
rex=c("nj|ngk|ngg|nc|nt|nd|mp|mb",
"ng|sy|kh",
"([b-df-hj-np-tv-xz])\\1+",
"[b-df-hj-np-tv-z]",
"(a|e|i|o|u)\\1+",
"a|e|i|o|u"), stringsAsFactors=FALSE)
xs <- xw <- df[,1]
for (i in 1:nrow(pat)) {
rx <- gregexpr(pat[i, 2], xs)
mc <- regmatches(xs, rx)
mp <- sapply(mc, function(x) format(paste("", x), width=6))
mc[lengths(mc) != 0] <- mp[lengths(mc) != 0]
regmatches(xw, rx) <- mc
regmatches(xs, rx) <- paste("", format(pat[i, 1], width=5))
}
phon <- trimws(cbind(word=xw, structure=xs))
phon <- apply(phon, 1, strsplit, " +")
phon <- lapply(phon, function(x) do.call(cbind, x))
head(phon, 3)
# [[1]]
# word structure
# [1,] "p" "C"
# [2,] "o" "v"
# [3,] "kk" "C:"
# [4,] "o" "v"
# [5,] "i" "v"
# [6,] "tt" "C:"
# [7,] "a" "v"
#
# [[2]]
# word structure
# [1,] "d" "C"
# [2,] "e" "v"
# [3,] "m" "C"
# [4,] "n" "C"
# [5,] "a" "v"
#
# [[3]]
# word structure
# [1,] "n" "C"
# [2,] "i" "v"
# [3,] "ng" "C"
# [4,] "a" "v"
# [5,] "t" "C"
# [6,] "i" "v"
# [7,] "ngg" "NC"
# [8,] "o" "v"

Related

Reorder vector so no certain items are positioned next to each other

Please consider the following example:
[[1]]
[1] 11 12 13 14
[[2]]
[1] 1 2 3
[[3]]
[1] 4
[[4]]
[1] 5
[[5]]
[1] 6
[[6]]
[1] 7
[[7]]
[1] 8
[[8]]
[1] 9
[[9]]
[1] 10
[[10]]
[1] 15
[[11]]
[1] 16
[[12]]
[1] 17
In this example, I have 12 unique values in a vector that is 17 elements long. For simplicity, let's say that this vector is:
foo_bar <- c("b","b","b","c","d","e","f","g","h","i","a","a","a","a", "j", "k", "l")
The first code block shows the index positions in foo_bar of each of the unique values (the letters a–l).
I am attempting to write an algorithm that reorders foo_bar so that, for all indices except the final one (index 17 in the foo_bar example), position i and position i+1 never contains the same two values. Here's an example of what would be an appropriate outcome:
reordered_foo_bar <- c("b","c","b","d","b","e","f","g","h","a","i","a","j","a","k","a", "l")
something like this?
foo_bar <- c("b","b","b","c","d","e","f","g","h","i","a","a","a","a", "j", "k", "l")
test == FALSE
while (test == FALSE) {
new_foo_bar <- sample(foo_bar, size = length(foo_bar), replace = FALSE)
test <- length(rle(new_foo_bar)$lengths) == length(foo_bar)
}
new_foo_bar
# [1] "f" "a" "g" "b" "h" "d" "j" "c" "e" "i" "a" "b" "k" "a" "l" "a" "b"
First we identify the indices of the unique values in the vector.
indices <-
unique(foo_bar) %>%
sort() %>%
lapply(function(x) which(foo_bar == x))
Then we create a position score based on 1) which order the value has when ordered by decreasing frequency and 2) how many previous occurences of this value has occurred, and we add these two values together. However, to ensure that we get a different value inserted between them, we divide 2) by 2. Finally, we order the position scores and reorder foo_bar with this new order.
This solution is also robust in case it is not possible to prevent duplicate values next to each other (for example because the values are c("a","a","b","a").
out <-
lengths(indices) %>%
lapply(., function(x) 1:x) %>%
{lapply(len_seq(.), function(x) (unlist(.[x]) + x / 2))} %>%
unlist() %>%
order() %>%
{unlist(indices)[.]} %>%
foo_bar[.]
The output is then:
> "a" "b" "a" "c" "b" "d" "a" "e" "b" "f" "a" "g" "h" "i" "j" "k" "l"

R: preserving 1-row / -column matrix [duplicate]

This question already has an answer here:
Is there anything wrong with using T & F instead of TRUE & FALSE?
(1 answer)
Closed 4 years ago.
Given a matrix with one row, one column, or one cell, I need to reorder the rows while keeping the matrix structure. I tried adding drop=F but it doesn't work! What did I do?
test = matrix(letters[1:5]) # is a matrix
test[5:1,,drop=F] # not a matrix
test2 = matrix(letters[1:5],nrow=1) # is a matrix
test2[1:1,,drop=F] # not a matrix
test3 = matrix(1) # is a matrix
test3[1:1,,drop=F] # not a matrix
I'd guess it was an overwritten F; F can be set as a variable, in which case it's no longer false. Always write out FALSE fully, it can't be set as a variable.
See Is there anything wrong with using T & F instead of TRUE & FALSE?
Also the R Inferno, section 8.1.32, is a good reference.
> F <- 1
> test = matrix(letters[1:5]) # is a matrix
> test[5:1,,drop=F] # not a matrix
[1] "e" "d" "c" "b" "a"
> test[5:1,,drop=FALSE] # but this is a matrix
[,1]
[1,] "e"
[2,] "d"
[3,] "c"
[4,] "b"
[5,] "a"
> rm(F)
> test[5:1,,drop=F] # now a matrix again
[,1]
[1,] "e"
[2,] "d"
[3,] "c"
[4,] "b"
[5,] "a"
The code in your question works fine in a fresh R session:
test = matrix(letters[1:5]) # is a matrix
result = test[5:1,,drop=F]
result
# [,1]
# [1,] "e"
# [2,] "d"
# [3,] "c"
# [4,] "b"
# [5,] "a"
class(result) # still a matrix
# [1] "matrix"
dim(result)
# [1] 5 1
Even on the 1x1 matrix:
test3 = matrix(1) # is a matrix
result3 = test3[1:1,,drop=F]
class(result3)
# [1] "matrix"
dim(result3)
# [1] 1 1
Maybe you've loaded other packages that are overriding the default behavior? What makes you think you don't end up with a matrix?
The following works:
test <- matrix(test[5:1,, drop = F], nrow = 5, ncol = 1)
When you use is.matrix to test it, the output is a matrix. At the same time, you specify the number of rows (nrow) and number of columns (ncol) to coerce it to the number of rows and columns you require.

How to generate stratified permutations in R

I would like to generate different possible permutations with the same frequency as in the input vector. For example, I would like to generate the permutations using the vector x in the below example.
library(gtools)
x <- c('A','A','B')
permutations(2, 3, x, repeats.allowed = T)
It gives the below output.
# [,1] [,2] [,3]
# [1,] "A" "A" "A"
# [2,] "A" "A" "B"
# [3,] "A" "B" "A"
# [4,] "A" "B" "B"
# [5,] "B" "A" "A"
# [6,] "B" "A" "B"
# [7,] "B" "B" "A"
# [8,] "B" "B" "B"
But, I want only permutations having A, B with frequencies 2, 1 respectively. The expected output is:
# [,1] [,2] [,3]
# [1,] "A" "A" "B"
# [2,] "A" "B" "A"
# [3,] "B" "A" "A"
Is there any function available in R?
Note: I do not want to do post-processing of the output to get the expected output as my original input contains 300 elements. It is not recommended to generate factorial(300) number of permutations.
Update: The suggested link provides a nice faster solution but fails when the input vector is doubled (eg: length=20) with the error message:
Error in matrix(NA, nrow = N, ncol = prod(sapply(foo, ncol))) :
invalid 'ncol' value (too large or NA)
Your problem can be reformulated as finding all possible permutations of the frequency vector. Take a look at combinat::permn:
x <- c( 'A', 'A', 'B' )
unique(combinat::permn( x ))
# [[1]]
# [1] "A" "A" "B"
# [[2]]
# [1] "A" "B" "A"
# [[3]]
# [1] "B" "A" "A"
unique is necessary to remove duplicate entries, which is automatically done by gtools::permutations you've been using (through the default set=TRUE argument).
If you need the result in matrix format, as in your original question, pass the output as arguments to rbind using do.call:
do.call( rbind, unique(combinat::permn( x )) )
# [,1] [,2] [,3]
# [1,] "A" "A" "B"
# [2,] "A" "B" "A"
# [3,] "B" "A" "A"

Split one column into two

I have a large data.table of genotypes (260,000 rows by 1000 columns). The rows are markers and the columns are the subjects. The data looks like this:
ID1 ID2 ID3 ID4
M1: CC CC TC CC
M2: GG GG GG GG
M3: TT TT TT TT
M4: TG TG TG TG
M5: TT TT TT TT
M6: TT TT TT TT
I need to split each genotype so that I have each allele in its own column like this:
V1 V2 V3 V4 V5 V6 V7 V8
M1: C C C C T C C C
M2: G G G G G G G G
M3: T T T T T T T T
M4: T G T G T G T G
M5: T T T T T T T T
M6: T T T T T T T T
I have come up with two solutions, both of which work on a subset of the data, but breaks down on the entire data set due to memory issues or some internal error of data.table that I dont understand.
I used strsplit on each column and stored it to a list, then used do.call to merge them all. I also parallelized it using the foreach function
ids <- colnames(DT)
gene.split <- function(i) {
as.data.table(do.call(rbind,strsplit(as.vector(eval(parse(text=paste("DT$",ids[i])))), split = "")))
}
all.gene <- foreach(i=1:length(ids)) %dopar% gene.split(i)
do.call(cbind,all.gene)
On 4 cores this breaks down due to memory issues.
The second solution is based on a similar problem which uses the set function:
out_names <- paste("V", 1:(2*ncol(DT)), sep="_")
invar1 <- names(DT)
for (i in seq_along(invar1)) {
set(DT, i=NULL, j=out_names[2*i-1], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,1])
set(DT, i=NULL, j=out_names[2*i], value=do.call(rbind, strsplit(DT[[invar1[i]]], split = ""))[,2])
}
which works on a few columns but then I get the following error if I try using the entire dataset:
Error in set(DT, i = NULL, j = out_names[2 * i - 1], value = do.call(rbind, :
Internal logical error. DT passed to assign has not been allocated enough column slots. l=163, tl=163, adding 1
Am I going about this the wrong way?
Here is an approach using data.table::set and substr (not strsplit)
Using #jbaums example data l
# coerce to `data.table` without a copy
setDT(l)
# over allocate columns so that `data.table` can assign by reference
# this will stop the error you were seeing
alloc.col(l,3000)
out_names <- paste("V", 1:(2*ncol(l)), sep="_")
invar1 <- names(l)
for (i in seq_along(invar1)) {
set(l, i=NULL, j=out_names[2*i-1], value=substr(l[[invar1[i]]],1,1))
set(l, i=NULL, j=out_names[2*i], value=substr(l[[invar1[i]]],2,2))
}
The final step took 37 seconds on my Windows 7 i7 2600 machine with 8GB ram
In your example you run strsplit twice (and use do.call(rbind....)) --> not efficient.
Some benchmarking of possible approaches to the splitting....
microbenchmark(substr(l[[invar1[1L]]],2,2), sapply(strsplit(l[[invar1[1L]]],''),`[`,2L),do.call(rbind, strsplit(l[[invar1[i]]], split = ""))[,2], times=5)
Unit: milliseconds
expr min lq median uq max neval
substr(l[[invar1[1L]]], 2, 2) 14.10669 14.35571 14.57485 15.78283 193.9125 5
sapply(strsplit(l[[invar1[1L]]], ""), `[`, 2L) 345.92969 1420.03907 1944.33873 3864.82876 5371.6130 5
do.call(rbind, strsplit(l[[invar1[i]]], split = ""))[, 2] 3318.70878 4131.38551 4155.06126 5269.92745 8414.4948 5
Here's a relatively fast approach - took ~80 sec (after dummy data creation) (Win 8.1 x64; i4770) but chewed up ~13 GB of RAM.
# Creating initial data
pairs <- c(outer(c('C', 'T', 'G', 'A'), c('C', 'T', 'G', 'A'), 'paste0'))
l <- replicate(1000, sample(pairs, 260000, replace=TRUE), simplify=FALSE)
system.time({
v <- do.call(paste0, l)
rm(l); gc()
out <- do.call(rbind, strsplit(v, ''))
rm(v); gc()
})
# user system elapsed
# 79.07 1.24 80.33
str(out)
# chr [1:260000, 1:2000] "A" "C" "C" "C" ...
Here's a way to do this for a data frame x:
do.call(cbind,
lapply(x,
function(i) do.call(rbind, strsplit(as.character(i), split=''))
)
)
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] "C" "C" "C" "C" "T" "C" "C" "C"
[2,] "G" "G" "G" "G" "G" "G" "G" "G"
[3,] "T" "T" "T" "T" "T" "T" "T" "T"
[4,] "T" "G" "T" "G" "T" "G" "T" "G"
[5,] "T" "T" "T" "T" "T" "T" "T" "T"
[6,] "T" "T" "T" "T" "T" "T" "T" "T"
Each column is split into characters, and then r-bound together. This gives a list of columns, which are then passed to cbind.
## make a small data.table for testing
dd <- data.table(ID1=c("CC","TG"),ID2=c("CC","TG"), ID3=c("TC","TG"))
dd
## ID1 ID2 ID3
## 1: CC CC TC
## 2: TG TG TG
## the first base
apply(dd,1:2,function(e) strsplit(e,split='')[[1]][1])
## ID1 ID2 ID3
## [1,] "C" "C" "T"
## [2,] "T" "T" "T"
## the second base
apply(dd,1:2,function(e) strsplit(e,split='')[[1]][2])
## ID1 ID2 ID3
## [1,] "C" "C" "C"
## [2,] "G" "G" "G"
## These results are in matrix, if you need data.table use as.data.table to convert them back.

Shuffling a vector - all possible outcomes of sample()?

I have a vector with five items.
my_vec <- c("a","b","a","c","d")
If I want to re-arrange those values into a new vector (shuffle), I could use sample():
shuffled_vec <- sample(my_vec)
Easy - but the sample() function only gives me one possible shuffle. What if I want to know all possible shuffling combinations? The various "combn" functions don't seem to help, and expand.grid() gives me every possible combination with replacement, when I need it without replacement. What's the most efficient way to do this?
Note that in my vector, I have the value "a" twice - therefore, in the set of shuffled vectors returned, they all should each have "a" twice in the set.
I think permn from the combinat package does what you want
library(combinat)
permn(my_vec)
A smaller example
> x
[1] "a" "a" "b"
> permn(x)
[[1]]
[1] "a" "a" "b"
[[2]]
[1] "a" "b" "a"
[[3]]
[1] "b" "a" "a"
[[4]]
[1] "b" "a" "a"
[[5]]
[1] "a" "b" "a"
[[6]]
[1] "a" "a" "b"
If the duplicates are a problem you could do something similar to this to get rid of duplicates
strsplit(unique(sapply(permn(my_vec), paste, collapse = ",")), ",")
Or probably a better approach to removing duplicates...
dat <- do.call(rbind, permn(my_vec))
dat[duplicated(dat),]
Noting that your data is effectively 5 levels from 1-5, encoded as "a", "b", "a", "c", and "d", I went looking for ways to get the permutations of the numbers 1-5 and then remap those to the levels you use.
Let's start with the input data:
my_vec <- c("a","b","a","c","d") # the character
my_vec_ind <- seq(1,length(my_vec),1) # their identifier
To get the permutations, I applied the function given at Generating all distinct permutations of a list in R:
permutations <- function(n){
if(n==1){
return(matrix(1))
} else {
sp <- permutations(n-1)
p <- nrow(sp)
A <- matrix(nrow=n*p,ncol=n)
for(i in 1:n){
A[(i-1)*p+1:p,] <- cbind(i,sp+(sp>=i))
}
return(A)
}
}
First, create a data.frame with the permutations:
tmp <- data.frame(permutations(length(my_vec)))
You now have a data frame tmp of 120 rows, where each row is a unique permutation of the numbers, 1-5:
>tmp
X1 X2 X3 X4 X5
1 1 2 3 4 5
2 1 2 3 5 4
3 1 2 4 3 5
...
119 5 4 3 1 2
120 5 4 3 2 1
Now you need to remap them to the strings you had. You can remap them using a variation on the theme of gsub(), proposed here: R: replace characters using gsub, how to create a function?
gsub2 <- function(pattern, replacement, x, ...) {
for(i in 1:length(pattern))
x <- gsub(pattern[i], replacement[i], x, ...)
x
}
gsub() won't work because you have more than one value in the replacement array.
You also need a function you can call using lapply() to use the gsub2() function on every element of your tmp data.frame.
remap <- function(x,
old,
new){
return(gsub2(pattern = old,
replacement = new,
fixed = TRUE,
x = as.character(x)))
}
Almost there. We do the mapping like this:
shuffled_vec <- as.data.frame(lapply(tmp,
remap,
old = as.character(my_vec_ind),
new = my_vec))
which can be simplified to...
shuffled_vec <- as.data.frame(lapply(data.frame(permutations(length(my_vec))),
remap,
old = as.character(my_vec_ind),
new = my_vec))
.. should you feel the need.
That gives you your required answer:
> shuffled_vec
X1 X2 X3 X4 X5
1 a b a c d
2 a b a d c
3 a b c a d
...
119 d c a a b
120 d c a b a
Looking at a previous question (R: generate all permutations of vector without duplicated elements), I can see that the gtools package has a function for this. I couldn't however get this to work directly on your vector as such:
permutations(n = 5, r = 5, v = my_vec)
#Error in permutations(n = 5, r = 5, v = my_vec) :
# too few different elements
You can adapt it however like so:
apply(permutations(n = 5, r = 5), 1, function(x) my_vec[x])
# [,1] [,2] [,3] [,4]
#[1,] "a" "a" "a" "a" ...
#[2,] "b" "b" "b" "b" ...
#[3,] "a" "a" "c" "c" ...
#[4,] "c" "d" "a" "d" ...
#[5,] "d" "c" "d" "a" ...

Resources