Replacing matrix cells for corresponding row names - r

I'm working with a matrix that looks like this input.
I'm trying to replace numbers on column 2 by their corresponding row name. I.e. all 1s would be replaced by row.name(matrix). Thus, I'd have the following output.
The actual matrix is too large for loop application... I'm sorry I'm using images since I found it easier to represent this on excel. I'm also sorry about being quite new at R...

Vectorized approach (should be the fastest you can get):
mat <- matrix(c(letters[1:11], 1,1,1,2,2,3,3,3,4,4,4), ncol = 2)
colnames(mat) <- c("A", "B")
rownames(mat) <- 1:11
> mat
A B
1 "a" "1"
2 "b" "1"
3 "c" "1"
4 "d" "2"
5 "e" "2"
6 "f" "3"
7 "g" "3"
8 "h" "3"
9 "i" "4"
10 "j" "4"
11 "k" "4"
mat[, "B"] <- mat[as.numeric(mat[, "B"]), "A"]
> mat
A B
1 "a" "a"
2 "b" "a"
3 "c" "a"
4 "d" "b"
5 "e" "b"
6 "f" "c"
7 "g" "c"
8 "h" "c"
9 "i" "d"
10 "j" "d"
11 "k" "d"
Or you could use sapply:
mat[, "B"] <- sapply(mat[, "B"], function(x) mat[as.numeric(x), "A"])
Edit: I've put the vectorized solution at the top, as this is clearly the faster (or even fastest?) approach.

Related

Make r ignore the order at which values appear in a column (created from pasting multiple columns)

Given a variable x that can take values A,B,C,D
And three columns for variable x:
df1<-
rbind(c("A","B","C"),c("A","D","C"),c("B","A","C"),c("A","C","B"), c("B","C","A"), c("D","A","B"), c("A","B","D"), c("A","D","C"), c("A",NA,NA),c("D","A",NA),c("A","D",NA))
How do I make column indicating the combination of in the three preceding column such that permutations (ABC, ACB, BAC) would be considered as the same combination of ABC, (AD, DA) would be considered as the same combination of AD?
Pasting the three columns with apply(df1,1,function(x) paste(x[!is.na(x)], collapse=", ")->df1$x4 and using df1%>%group(x4)%>%summarize(c=count(x4)) would count AD,DA as different instead of the same.
Edited title
My desired result would be to get
a<-cbind(c("ABC",4),c("ACD",2),c("ABD",2),c("A",1),c("AD",2))
Someone already solved my question. Thanks
You can apply function paste after sorting each row vector.
df1 <-
cbind(df1, apply(df1, 1, function(x) paste(sort(x), collapse = "")))
df1
# [,1] [,2] [,3] [,4]
# [1,] "A" "B" "C" "ABC"
# [2,] "A" "D" "C" "ACD"
# [3,] "B" "A" "C" "ABC"
# [4,] "A" "C" "B" "ABC"
# [5,] "B" "C" "A" "ABC"
# [6,] "D" "A" "B" "ABD"
# [7,] "A" "B" "D" "ABD"
# [8,] "A" "D" "C" "ACD"
# [9,] "A" NA NA "A"
#[10,] "D" "A" NA "AD"
#[11,] "A" "D" NA "AD"
You can now simply table the column, with no need for an external package to be loaded and more complex pipes.
table(df1[, 4])
#A ABC ABD ACD AD
#1 4 2 2 2

How to convert frequency into text by using R?

I have dataframe like this (ID, Frequency A B C D E)
ID A B C D E
1 5 3 2 1 0
2 3 2 2 1 0
3 4 2 1 1 1
I want to convert this dataframe into test based document like this (ID and their frequency ABCDE as words in a single column). Then I may use LDA algorithm to identify hot topics for each ID.
ID Text
1 "A" "A" "A" "A" "A" "B" "B" "B" "C" "C" "D"
2 "A" "A" "A" "B" "B" "C" "C" "D"
3 "A" "A" "A" "A" "B" "B" "C" "D" "E"
We can use data.table
library(data.table)
DT <- setDT(df1)[,.(list(rep(names(df1)[-1], unlist(.SD)))) ,ID]
DT$V1
#[[1]]
#[1] "A" "A" "A" "A" "A" "B" "B" "B" "C" "C" "D"
#[[2]]
#[1] "A" "A" "A" "B" "B" "C" "C" "D"
#[[3]]
#[1] "A" "A" "A" "A" "B" "B" "C" "D" "E"
Or a base R option is split
lst <- lapply(split(df1[-1], df1$ID), rep, x=names(df1)[-1])
lst
#$`1`
#[1] "A" "A" "A" "A" "A" "B" "B" "B" "C" "C" "D"
#$`2`
#[1] "A" "A" "A" "B" "B" "C" "C" "D"
#$`3`
#[1] "A" "A" "A" "A" "B" "B" "C" "D" "E"
If we want to write the 'lst' to csv file, one option is convert the list to data.frame by appending NA at the end to make the length equal while converting to data.frame (as data.frame is a list with equal length (columns))
res <- do.call(rbind, lapply(lst, `length<-`, max(lengths(lst))))
Or use a convenient function from stringi
library(stringi)
res <- stri_list2matrix(lst, byrow=TRUE)
and then use the write.csv
write.csv(res, "yourdata.csv", quote=FALSE, row.names = FALSE)
You can use apply and rep like so:
apply(df[-1], 1, function(i) rep(names(df)[-1], i))
For each row, apply feeds the rep function the number of times to repeat each variable name. This returns a list of vectors:
[[1]]
[1] "A" "A" "A" "A" "A" "B" "B" "B" "C" "C" "D"
[[2]]
[1] "A" "A" "A" "B" "B" "C" "C" "D"
[[3]]
[1] "A" "A" "A" "A" "B" "B" "C" "D" "E"
Where each list element is a row of your data.frame.
data
df <- read.table(header=T, text="ID A B C D E
1 5 3 2 1 0
2 3 2 2 1 0
3 4 2 1 1 1")

Make list of variables automatically from other vectors in r

I have a matrix of 107 dna sequences (columns) each 10 bases long (rows). I also have a vector of population frequencies called nSamples, and a vector of names for these populations called dnapops. I would like to automatically create a nested list than contains seperately the first 27 sequences as dna1, the next 27 as dna2, the next 17 as dna3.....and so on until all 107 sequences are in their respective population in the list.
This needs to be dynamic as the the number of populations and dna sequences changes from application to application:
$dna1
$dna1$'1'
[1] "g" "t" "g" "a" "t" "t" "c" "c" "g" "g"
$dna1$'2'
[1] "g" "t" "g" "a" "t" "t" "c" "c" "g" "g"
and so on until
$dna1$'27'
[1] "g" "t" "g" "a" "t" "t" "c" "c" "g" "g"
then it goes to dna2 and lists its 27 sequences, then dna3 and lists its 17 sequences.......
dna <- matrix(data=sample(c("a","g","c","t"),1070,replace=T),nrow=10,ncol=107)
nSamples <- c(27,27,17,12,1,10,3,1,6,3)
dnapops <- c("dna1","dna2","dna3","dna4","dna5","dna6","dna7","dna8","dna9","dna10")
We can replicate the sequence of 'nSamples' with the 'nSamples' and split the sequence of columns of 'dna' using that, extract the columns based on the sequence index and split by the col.
lst <- lapply(split(seq_len(ncol(dna)),rep(seq_along(nSamples), nSamples)),
function(i) {x1 <- dna[,i, drop=FALSE]
split(x1, col(x1)) })
lengths(lst)
# 1 2 3 4 5 6 7 8 9 10
#27 27 17 12 1 10 3 1 6 3
lst[[1]][1:5]
#$`1`
#[1] "g" "a" "c" "c" "c" "t" "g" "t" "t" "g"
#$`2`
#[1] "c" "g" "c" "c" "g" "t" "a" "a" "c" "a"
#$`3`
#[1] "a" "c" "c" "a" "a" "c" "a" "c" "c" "a"
#$`4`
#[1] "g" "a" "g" "a" "t" "a" "c" "c" "c" "t"
#$`5`
#[1] "g" "g" "g" "a" "a" "a" "g" "g" "g" "g"
data
set.seed(24)
dna <- matrix(data=sample(c("a","g","c","t"),1070,replace=T),nrow=10,ncol=107)

Splitting a vector of strings to a dataframe with columns containing the respective characters

As a variant of this question
I have a vector with strings, each string has 2 to 4 characters.
Strng <- c("XDX", "GUV", "FQ", "ACUE", "HIT", "AYX", "NFD", "AHBW", "GKQ", "PYF")
I want to split it to data frame with 4 columns, where each column contains one of the characters or 0 (for the case where the length of the string is less tan 4). The zeros can be in front of - doesn't matter.
So (probably) after applying this:
ss<-strsplit(Strng,"")
z<-lapply(ss,as.character)
I would like to have a dataframe like this:
>df
"X" "D" "X" "0"
"G" "U" "V" "0"
"F" "Q" "0" "0"
"A" "C" "U" "E"
"H" "I" "T" "0"
"A" "Y" "X" "0"
"N" "F" "D" "0"
"A" "H" "B" "W"
"G" "K" "Q" "0"
"P" "Y" "F" "0"
Any ideas?
Thank you,
Kalin
Here's an alternative with "data.table":
library(data.table)
setDT(tstrsplit(Strng, "", fill = "0"))[]
# V1 V2 V3 V4
# 1: X D X 0
# 2: G U V 0
# 3: F Q 0 0
# 4: A C U E
# 5: H I T 0
# 6: A Y X 0
# 7: N F D 0
# 8: A H B W
# 9: G K Q 0
# 10: P Y F 0
You could also use cSplit from my "splitstackshape" package, but it fills with NA and uses a little bit strange syntax:
library(splitstackshape)
cSplit(data.table(Strng), "Strng", "", stripWhite = FALSE)
We can use stri_list2matrix from stringi after we split the "Strng" to a list.
library(stringi)
stri_list2matrix(strsplit(Strng, ''), fill=0, byrow=TRUE)
# [,1] [,2] [,3] [,4]
# [1,] "X" "D" "X" "0"
# [2,] "G" "U" "V" "0"
# [3,] "F" "Q" "0" "0"
# [4,] "A" "C" "U" "E"
# [5,] "H" "I" "T" "0"
# [6,] "A" "Y" "X" "0"
# [7,] "N" "F" "D" "0"
# [8,] "A" "H" "B" "W"
# [9,] "G" "K" "Q" "0"
#[10,] "P" "Y" "F" "0"
Or a base R option would be (variant of the one described in the link)
read.fwf(file= textConnection(Strng),
widths = rep(1,max(nchar(Strng))))

pairwise analysis in R

I have a large data-frame in which I have to find the columns when both rows are equal for pairs of individuals.
Here is an example of the dataframe:
>data
ID pos1234 pos1345 pos1456 pos1678
1 1 C A C G
2 2 C G A G
3 3 C A G A
4 4 C G C T
I transformed the dataframe into a pairwise matrix with:
apply(data, 2, combn, m=2)
ID pos1234 pos1345 pos1456 pos1678
[1,] "1" "C" "A" "C" "G"
[2,] "2" "C" "G" "A" "G"
[3,] "1" "C" "A" "C" "G"
[4,] "3" "C" "A" "G" "A"
[5,] "1" "C" "A" "C" "G"
[6,] "4" "C" "G" "C" "T"
[7,] "2" "C" "G" "A" "G"
[8,] "3" "C" "A" "G" "A"
[9,] "2" "C" "G" "A" "G"
[10,] "4" "C" "G" "C" "T"
[11,] "3" "C" "A" "G" "A"
[12,] "4" "C" "G" "C" "T"
I am now having trouble identifying the column containing the identical letters between pairs. For example, for pairs 1 and 2 the columns containing the identical letters would be pos1234 and pos1678.
Would it be possible get a dataframe with just identical letters for each pair of individuals?
Thanks in advance.
You can pass a function to combn:
res0 <- combn(nrow(data), 2, FUN = function(x)
names(data[-1])[ lengths(sapply(data[x,-1], unique)) == 1 ], simplify=FALSE)
which gives
[[1]]
[1] "pos1234" "pos1678"
[[2]]
[1] "pos1234" "pos1345"
[[3]]
[1] "pos1234" "pos1456"
[[4]]
[1] "pos1234"
[[5]]
[1] "pos1234" "pos1345"
[[6]]
[1] "pos1234"
To figure out which of these [[1]]..[[6]] correspond to which pairs, take combn again:
res <- setNames(res0, combn(data$ID, 2, paste, collapse="."))
which gives
$`1.2`
[1] "pos1234" "pos1678"
$`1.3`
[1] "pos1234" "pos1345"
$`1.4`
[1] "pos1234" "pos1456"
$`2.3`
[1] "pos1234"
$`2.4`
[1] "pos1234" "pos1345"
$`3.4`
[1] "pos1234"

Resources