Searching pairs in matrix in R - r

I am rather new to R, so I would be grateful if anyone could help me :)
I have a large matrices, for example:
matrix
and a vector of genes.
My task is to search the matrix row by row and compile pairs of genes with mutations (on the matrix is D707H) with the rest of the genes contained in the vector and add it to a new matrix. I tried do this with loops but i have no idea how to write it correctly. For this matrix it should look sth like this:
PR.02.1431
NBN BRCA1
NBN BRCA2
NBN CHEK2
NBN ELAC2
NBN MSR1
NBN PARP1
NBN RNASEL
Now i have sth like this:
my idea
"a" is my initial matrix.
Can anyone point me in the right direction? :)

Perhaps what you want/need is which(..., arr.ind = TRUE).
Some sample data, for demonstration:
set.seed(2)
n <- 10
mtx <- array(NA, dim = c(n, n))
dimnames(mtx) <- list(letters[1:n], LETTERS[1:n])
mtx[sample(n*n, size = 4)] <- paste0("x", 1:4)
mtx
# A B C D E F G H I J
# a NA NA NA NA NA NA NA NA NA NA
# b NA NA NA NA NA NA NA NA NA NA
# c NA NA NA NA NA NA NA NA NA NA
# d NA NA NA NA NA NA NA NA NA NA
# e NA NA NA NA NA NA NA NA NA NA
# f NA NA NA NA NA NA NA NA NA NA
# g NA "x4" NA NA NA "x3" NA NA NA NA
# h NA NA NA NA NA NA NA NA NA NA
# i NA "x1" NA NA NA NA NA NA NA NA
# j NA NA NA NA NA NA "x2" NA NA NA
In your case, it appears that you want anything that is not an NA or NaN. You might try:
which(! is.na(mtx) & ! is.nan(mtx))
# [1] 17 19 57 70
but that isn't always intuitive when retrieving the row/column pairs (genes, I think?). Try instead:
ind <- which(! is.na(mtx) & ! is.nan(mtx), arr.ind = TRUE)
ind
# row col
# g 7 2
# i 9 2
# g 7 6
# j 10 7
How to use this: the integers are row and column indices, respectively. Assuming your matrix is using row names and column names, you can retrieve the row names with:
rownames(mtx)[ ind[,"row"] ]
# [1] "g" "i" "g" "j"
(An astute reader might suggest I use rownames(ind) instead. It certainly works!) Similarly for the colnames and "col".
Interestingly enough, even though ind is a matrix itself, you can subset mtx fairly easily with:
mtx[ind]
# [1] "x4" "x1" "x3" "x2"
Combining all three together, you might be able to use:
data.frame(
gene1 = rownames(mtx)[ ind[,"row"] ],
gene2 = colnames(mtx)[ ind[,"col"] ],
val = mtx[ind]
)
# gene1 gene2 val
# 1 g B x4
# 2 i B x1
# 3 g F x3
# 4 j G x2

I know where my misteke was, now i have matrix. Analyzing your code it works good, but that's not exactly what I want to do.
a, b, c, d etc. are organisms and row names are genes (A, B, C, D etc.). I have to cobine pairs of genes where one of it (in the same column) has sth else than NA value. For example if gene A has value=4 in column a I have to have:
gene1 gene2
a A B
a A C
a A D
a A E
I tried in this way but number of elements do not match and i do not know how to solve this.
ind= which(! is.na(a) & ! is.nan(a), arr.ind = TRUE)
ind1=which(macierz==1,arr.ind = TRUE)
ramka= data.frame(
kolumna = rownames(a)[ ind[,"row"] ],
gene1 = colnames(a)[ ind[,"col"] ],
gene2 = colnames(a)[ind1[,"col"]],
#val = macierz[ind]
)
Do you know how to do this in R?

Related

How to select n random values from each rows of a dataframe in R?

I have a dataframe
df= data.frame(a=c(56,23,15,10),
b=c(43,NA,90.7,30.5),
c=c(12,7,10,2),
d=c(1,2,3,4),
e=c(NA,45,2,NA))
I want to select two random non-NA row values from each row and convert the rest to NA
Required Output- will differ because of randomness
df= data.frame(
a=c(56,NA,15,NA),
b=c(43,NA,NA,NA),
c=c(NA,7,NA,2),
d=c(NA,NA,3,4),
e=c(NA,45,NA,NA))
Code Used
I know to select random non-NA value from specific rows
set.seed(2)
sample(which(!is.na(df[1,])),2)
But no idea how to apply it all dataframe and get the required output
You may write a function to keep n random values in a row.
keep_n_value <- function(x, n) {
x1 <- which(!is.na(x))
x[-sample(x1, n)] <- NA
x
}
Apply the function by row using base R -
set.seed(123)
df[] <- t(apply(df, 1, keep_n_value, 2))
df
# a b c d e
#1 NA NA 12 1 NA
#2 NA NA 7 2 NA
#3 NA 90.7 10 NA NA
#4 NA 30.5 NA 4 NA
Or if you prefer tidyverse -
purrr::pmap_df(df, ~keep_n_value(c(...), 2))
Base R:
You could try column wise apply (sapply) and randomly replace two non-NA values to be NA, like:
as.data.frame(sapply(df, function(x) replace(x, sample(which(!is.na(x)), 2), NA)))
Example Output:
a b c d e
1 56 NA 12 NA NA
2 23 NA NA 2 NA
3 NA NA 10 3 NA
4 NA 30.5 NA NA NA
One option using dplyr and purrr could be:
df %>%
mutate(pmap_dfr(across(everything()), ~ `[<-`(c(...), !seq_along(c(...)) %in% sample(which(!is.na(c(...))), 2), NA)))
a b c d e
1 56 43.0 NA NA NA
2 23 NA 7 NA NA
3 15 NA NA NA 2
4 NA 30.5 2 NA NA

Conditional subsetting of data frame keeping previous row

My data frame looks like this
Model w0 p0 w1 p1 w2 p.value
1 Null_model 3.950000e-05 0.7366921 0.988374029 0.000000e+00 1.296464
2 alt_test 1.366006e-02 0.4673263 0.139606503 3.049244e-01 1.146653
3 alt_ref 2.000000e-07 0.4673263 0.000846849 3.049244e-01 1.635038 5.550000e-15
8 Null_model 2.790000e-05 0.7240479 0.987016439 0.000000e+00 1.263556
9 alt_test 7.550000e-09 0.7231176 0.991768899 1.060000e-13 1.369259
10 alt_ref 2.770000e-05 0.7231176 0.995373167 1.060000e-13 1.192839 3.073496e-01
... ... ... ... ... ... ...
What I want is to subset my data.frame in a way that keeps every case where p.value < 0.05 but it also keeps the previous rows to these cases.
So ideally my output will be something like this
Model w0 w1 w2
2 alt_test 1.4e-0.2 0.139606503 1.146653
3 alt_ref 2.00e-07 0.000846849 1.635038
I've tried the following but it doesn't work quite right:
subset(v, p.value < 0.05, select = c(Model,w0,w1,w2))
the output doesn't have the alt_test row.
I have also tried
with(v, ifelse(p.value < 0.05, paste(dplyr::lag(c(w0,w1,w2),1)), ""))
and the output in this case looks like
[1] NA NA NA NA "0.013660056" NA NA NA NA ""
[11] NA NA NA NA "" NA NA NA NA ""
[21] NA NA NA NA "" NA NA NA NA ""
[31] NA NA NA NA "" NA NA NA NA ""
[41] NA NA NA NA "" NA NA NA NA ""
[51] NA NA NA NA "1.34e-11" NA NA NA NA "" ...
I also tried
subset(v, p.value < 0.05, select = c(w0, w1,w2, w0-1, w1-1, w2-1))
but this gives the previous column, so I was wondering if something similar can give previous rows instead?
Thank you
If your data.frame always has alternating structure as alt_test and alt_ref, then you can manually construct the subset index as below:
library(data.table)
setDT(myDf)
myDf[Reduce(function(x,y) ifelse(!is.na(x), x, ifelse(!is.na(y), y, F)),
shift(p.Value < 0.05, n = 0:1, type = "lead")), .(Model,w0,w1,w2)]

Create a one line data frame with NAs for a list of column names

I have a list with names and I would like to create a data frame with these names as column names and one NA value. I will replace some of the NAs during a loop.
n <- c('a','b')
d <- data.frame(a=NA, b=NA)
So basically I have a vector like n and I would like to automatically create a NA-data frame like d. Is there a handy way of doing so?
There are a few different ways you do this. Here are two -
setNames(do.call(data.frame, rep(list(NA), length(n))), n)
# a b
# 1 NA NA
which is basically generalized for any n
N <- letters[1:6]
setNames(do.call(data.frame, rep(list(NA), length(N))), N)
# a b c d e f
# 1 NA NA NA NA NA NA
A second method uses as.data.frame()
as.data.frame(setNames(rep(list(NA), length(N)), N))
# a b c d e f
# 1 NA NA NA NA NA NA
Or, since your just using NA values, NA[seq_along(N)] can replace rep()
setNames(data.frame(as.list(NA[seq_along(N)])), N)
# a b c d e f
# 1 NA NA NA NA NA NA
Note that all these will produce logical classed columns. For other classes, you can use NA_integer_, NA_character_, etc.

how to make a text parsing function efficient in R

I have this function which calculates the consonanceScore of a book. First I import the phonetics dictionary from CMU (which forms a dataframe of about 134000 rows and 33 column variables; any row in the CMUdictionary is basically of the form CLOUDS K L AW1 D Z. The first column has the words, and the remaining columns have their phonetic equivalents). After getting the CMU dictionary, I parse a book into a vector containing all the words; max-length of any one book (so far): 218711 . Each word's phonetics are compared with the phonetics in the consecutive word, and the consecutive+1 word. The TRUE match values are then combined into a sum. The function I have is this:
getConsonanceScore <- function(book, consonanceScore, CMUdict) {
for (i in 1:((length(book)) - 2)) {
index1 <- replaceIfEmpty(which (toupper(book[i]) == CMUdict[,1]))
index2 <- replaceIfEmpty(which (toupper(book[i + 1]) == CMUdict[,1]))
index3 <- replaceIfEmpty(which (toupper(book[i + 2]) == CMUdict[,1]))
word1 <- as.character(CMUdict[index1, which(CMUdict[index1,] != "")])
word2 <- as.character(CMUdict[index2, which(CMUdict[index2,] != "")])
word3 <- as.character(CMUdict[index3, which(CMUdict[index3,] != "")])
consonanceScore <- sum(word1 %in% word2)
consonanceScore <- consonanceScore + sum(word1 %in% word3)
consonanceScore <- consonanceScore / length(book)
}
return(consonanceScore)
}
A replaceIfEmpty function basically just returns the index for a dummy value (that has been declared in the last row of the dataframe) if there is no match found in the CMU dictionary for any word in the book. It goes like this:
replaceIfEmpty <- function(x) {
if (length(x) > 0)
{
return (x)
}
else
{
x = 133780
return(x)
}
}
The issue that I am facing is that getConsonanceScore function takes a lot of time. So much so that in the loop, I had to divide the book length by 1000 just to check if the function was working alright. I am new to R, and would really be grateful for some help on making this function more efficient and consume less time, are there any ways of doing this? (I have to later call this function on possibly 50-100 books) Thanks a lot!
I've re-read recently your question, comments and #wibeasley's answer and got that didn't understand everything correctly. Now it have become more clear, and I'll try to suggest something useful.
First of all, we need a small example to work with. I've made it from the dictionary in your link.
dictdf <- read.table(text =
"A AH0
CALLED K AO1 L D
DOG D AO1 G
DOGMA D AA1 G M AH0
HAVE HH AE1 V
I AY1",
header = F, col.names = paste0("V", 1:25), fill = T, stringsAsFactors = F )
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 A AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 CALLED K AO1 L D NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 DOG D AO1 G NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 4 DOGMA D AA1 G M AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 HAVE HH AE1 V NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 I AY1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
bookdf <- data.frame(words = c("I", "have", "a", "dog", "called", "Dogma"))
# words
# 1 I
# 2 have
# 3 a
# 4 dog
# 5 called
# 6 Dogma
Here we read data from dictionary with fill = T and manually define number of columns in data.frame by setting col.names. You may make 50, 100 or some other number of columns (but I don't think there are so long words in the dictionary). And we make a bookdf - a vector of words in the form of data.frame.
Then let's merge book and dictionary together. I use dplyr library mentioned by #wibeasley.
# for big data frames dplyr does merging fast
require("dplyr")
# make all letters uppercase
bookdf[,1] <- toupper(bookdf[,1])
# merge
bookphon <- left_join(bookdf, dictdf, by = c("words" = "V1"))
# words V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 I AY1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 HAVE HH AE1 V NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 A AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 4 DOG D AO1 G NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 CALLED K AO1 L D NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 DOGMA D AA1 G M AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
And after that we scan rowwise for matching sounds in consecutive words. I arranged it with the help of sapply.
consonanceScore <-
sapply(1:(nrow(bookphon)-2),
conScore <- function(i_row)
{
word1 <- bookphon[i_row,][,-1]
word2 <- bookphon[i_row+1,][,-1]
word3 <- bookphon[i_row+2,][,-1]
word1 <- unlist( word1[which(!is.na(word1) & word1 != "")] )
word2 <- unlist( word2[which(!is.na(word2) & word2 != "")] )
word3 <- unlist( word3[which(!is.na(word3) & word3 != "")] )
sum(word1 %in% word2) + sum(word1 %in% word3)
})
[1] 0 0 0 4
There are no same phonemes in first three rows but the 4-th word 'dog' has 2 matching sounds with 'called' (D and O/A) and 2 matches with 'dogma' (D and G). The result is a numeric vector, you can sum() it, divide by nrow(bookdf) or whatever you need.
Are you sure it's working correctly? Isn't that function returning consonanceScore just for the last three words of the book? If the loop's 3rd-to-last-line is
consonanceScore <- sum(word1 %in% word2)
, how is its value being recorded, or influencing later iterations of the loop?
There are several vectorization approaches that will increase your speed, but for something tricky like this, I like making sure the slow loopy way is working correctly first. While you're in that stage of development, here are some suggestions how to make the code quicker and/or neater (which hopefully helps you debug with more clarity).
Short-term suggestions
Inside replaceIfEmpty(), use ifelse(). Maybe even use ifelse() directly inside the main function.
Why is as.character() necessary? That casting can be expensive. Are those columns factors? If so, use , stringsAsFactors=F when you use something like read.csv().
Don't use toupper() three times for each iteration. Just convert the whole thing once before the loop starts.
Similarly, don't execute / length(book) for each iteration. Since it's the same denominator for the whole book, divide the final vector of numerators only once (after the loop's done).
Long-term suggestions
Eventually I think you'll want to lookup each word only once, instead of three times. Those lookups are expensive. Similar to #inscaven 's suggestion, I think an intermediate table make sense (where each row is a book's word).
To produce the intermediate table, you should get much better performance from a join function written and optimized by someone else in C/C++. Consider something like dplyr::left_join(). Maybe book has to be converted to a single-variable data.frame first. Then left join it to the first column of the dictionary. The row's subsequent columns will essentially be appended to the right side of book (which I think is what's happening now).
Once each iteration is quicker and correct, consider using one of the xapply functions, or something in dplyr. The advantage of these functions is that memory for the entire vector isn't destroyed and reallocated for every single word in each book.

R Loop Script to Create Many, Many Variables

I want to create a lot of variables across several separate dataframes which I will then combine into one grand data frame.
Each sheet is labeled by a letter (there are 24) and each sheet contributes somewhere between 100-200 variables. I could write it as such:
a$varible1 <- NA
a$variable2 <- NA
.
.
.
w$variable25 <- NA
This can/will get ugly, and I'd like to write a loop or use a vector to do the work. I'm having a heck of a time doing it though.
I essentially need a script which will allow me to specify a form and then just tack numbers onto it.
So,
a$variable[i] <- NA
where [i] gets tacked onto the actual variable created.
I just learnt this neat little trick from #eddi
#created some random dataset with 3 columns
library(data.table)
a <- data.table(
a1 = c(1,5),
a2 = c(2,1),
a3 = c(3,4)
)
#assuming that you now need to ad more columns from a4 to a200
# first, creating the sequence from 4 to 200
v = c(4:200)
# then using that sequence to add the 197 more columns
a[, paste0("a", v) :=
NA]
# now a has 200 columns, as compared to the three we initiated it with
dim(a)
#[1] 2 200
I don't think you actually need this, although you seem to think so for some reason.
Maybe something like this:
a <- as.data.frame(matrix(NA, ncol=10, nrow=5))
names(a) <- paste0("Variable", 1:10)
print(a)
# Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10
# 1 NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA
If you want variables with different types:
p <- 10 # number of variables
N <- 100 # number of records
vn <- vector(mode="list", length=p)
names(vn) <- paste0("V", seq(p))
vn[1:8] <- NA_real_ # numeric
vn[9:10] <- NA_character_ # character
df <- as.data.frame(lapply(vn, function(x, n) rep(x, n), n=N))

Resources