how to make a text parsing function efficient in R

how to make a text parsing function efficient in R - r

I have this function which calculates the consonanceScore of a book. First I import the phonetics dictionary from CMU (which forms a dataframe of about 134000 rows and 33 column variables; any row in the CMUdictionary is basically of the form CLOUDS K L AW1 D Z. The first column has the words, and the remaining columns have their phonetic equivalents). After getting the CMU dictionary, I parse a book into a vector containing all the words; max-length of any one book (so far): 218711 . Each word's phonetics are compared with the phonetics in the consecutive word, and the consecutive+1 word. The TRUE match values are then combined into a sum. The function I have is this:
getConsonanceScore <- function(book, consonanceScore, CMUdict) {
for (i in 1:((length(book)) - 2)) {
index1 <- replaceIfEmpty(which (toupper(book[i]) == CMUdict[,1]))
index2 <- replaceIfEmpty(which (toupper(book[i + 1]) == CMUdict[,1]))
index3 <- replaceIfEmpty(which (toupper(book[i + 2]) == CMUdict[,1]))
word1 <- as.character(CMUdict[index1, which(CMUdict[index1,] != "")])
word2 <- as.character(CMUdict[index2, which(CMUdict[index2,] != "")])
word3 <- as.character(CMUdict[index3, which(CMUdict[index3,] != "")])
consonanceScore <- sum(word1 %in% word2)
consonanceScore <- consonanceScore + sum(word1 %in% word3)
consonanceScore <- consonanceScore / length(book)
}
return(consonanceScore)
}
A replaceIfEmpty function basically just returns the index for a dummy value (that has been declared in the last row of the dataframe) if there is no match found in the CMU dictionary for any word in the book. It goes like this:
replaceIfEmpty <- function(x) {
if (length(x) > 0)
{
return (x)
}
else
{
x = 133780
return(x)
}
}
The issue that I am facing is that getConsonanceScore function takes a lot of time. So much so that in the loop, I had to divide the book length by 1000 just to check if the function was working alright. I am new to R, and would really be grateful for some help on making this function more efficient and consume less time, are there any ways of doing this? (I have to later call this function on possibly 50-100 books) Thanks a lot!

I've re-read recently your question, comments and #wibeasley's answer and got that didn't understand everything correctly. Now it have become more clear, and I'll try to suggest something useful.
First of all, we need a small example to work with. I've made it from the dictionary in your link.
dictdf <- read.table(text =
"A AH0
CALLED K AO1 L D
DOG D AO1 G
DOGMA D AA1 G M AH0
HAVE HH AE1 V
I AY1",
header = F, col.names = paste0("V", 1:25), fill = T, stringsAsFactors = F )
# V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 A AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 CALLED K AO1 L D NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 DOG D AO1 G NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 4 DOGMA D AA1 G M AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 HAVE HH AE1 V NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 I AY1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
bookdf <- data.frame(words = c("I", "have", "a", "dog", "called", "Dogma"))
# words
# 1 I
# 2 have
# 3 a
# 4 dog
# 5 called
# 6 Dogma
Here we read data from dictionary with fill = T and manually define number of columns in data.frame by setting col.names. You may make 50, 100 or some other number of columns (but I don't think there are so long words in the dictionary). And we make a bookdf - a vector of words in the form of data.frame.
Then let's merge book and dictionary together. I use dplyr library mentioned by #wibeasley.
# for big data frames dplyr does merging fast
require("dplyr")
# make all letters uppercase
bookdf[,1] <- toupper(bookdf[,1])
# merge
bookphon <- left_join(bookdf, dictdf, by = c("words" = "V1"))
# words V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 V20 V21 V22 V23 V24 V25
# 1 I AY1 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 2 HAVE HH AE1 V NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 3 A AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 4 DOG D AO1 G NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 5 CALLED K AO1 L D NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
# 6 DOGMA D AA1 G M AH0 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
And after that we scan rowwise for matching sounds in consecutive words. I arranged it with the help of sapply.
consonanceScore <-
sapply(1:(nrow(bookphon)-2),
conScore <- function(i_row)
{
word1 <- bookphon[i_row,][,-1]
word2 <- bookphon[i_row+1,][,-1]
word3 <- bookphon[i_row+2,][,-1]
word1 <- unlist( word1[which(!is.na(word1) & word1 != "")] )
word2 <- unlist( word2[which(!is.na(word2) & word2 != "")] )
word3 <- unlist( word3[which(!is.na(word3) & word3 != "")] )
sum(word1 %in% word2) + sum(word1 %in% word3)
})
[1] 0 0 0 4
There are no same phonemes in first three rows but the 4-th word 'dog' has 2 matching sounds with 'called' (D and O/A) and 2 matches with 'dogma' (D and G). The result is a numeric vector, you can sum() it, divide by nrow(bookdf) or whatever you need.

Are you sure it's working correctly? Isn't that function returning consonanceScore just for the last three words of the book? If the loop's 3rd-to-last-line is
consonanceScore <- sum(word1 %in% word2)
, how is its value being recorded, or influencing later iterations of the loop?
There are several vectorization approaches that will increase your speed, but for something tricky like this, I like making sure the slow loopy way is working correctly first. While you're in that stage of development, here are some suggestions how to make the code quicker and/or neater (which hopefully helps you debug with more clarity).
Short-term suggestions
Inside replaceIfEmpty(), use ifelse(). Maybe even use ifelse() directly inside the main function.
Why is as.character() necessary? That casting can be expensive. Are those columns factors? If so, use , stringsAsFactors=F when you use something like read.csv().
Don't use toupper() three times for each iteration. Just convert the whole thing once before the loop starts.
Similarly, don't execute / length(book) for each iteration. Since it's the same denominator for the whole book, divide the final vector of numerators only once (after the loop's done).
Long-term suggestions
Eventually I think you'll want to lookup each word only once, instead of three times. Those lookups are expensive. Similar to #inscaven 's suggestion, I think an intermediate table make sense (where each row is a book's word).
To produce the intermediate table, you should get much better performance from a join function written and optimized by someone else in C/C++. Consider something like dplyr::left_join(). Maybe book has to be converted to a single-variable data.frame first. Then left join it to the first column of the dictionary. The row's subsequent columns will essentially be appended to the right side of book (which I think is what's happening now).
Once each iteration is quicker and correct, consider using one of the xapply functions, or something in dplyr. The advantage of these functions is that memory for the entire vector isn't destroyed and reallocated for every single word in each book.

Related

How to get name row as variable in function and plot density graph

I have issues with my function, i don't know if the problem is in the function or in my way to called it.
I have big dataframe with > 20000 row and around 700 columns, with each row a part of a gene and i want to calculate density for each row + plot the density plot with name of the gene.
baseM <- read.csv("expansions_full_omim_06_07_21.2.csv", sep = "\t")
rownames(baseM) <- paste(baseM$motif, baseM$chromosome, baseM$intervalle , baseM$gene , baseM$localisation, baseM$OMIM, sep = ".")
baseM.num <- baseM[sapply(baseM, is.numeric)]
names <- rownames(baseM.num.fltr)
d.density <- function(X, n){
#print(X)
d <- density(as.numeric(as.matrix(X)), na.rm=T)
peaks <- NULL
for (i in 2:(length(d$y)-1)) {
if (d$y[i-1] >= d$y[i] & d$y[i] <= d$y[i+1]) {
peaks <- cbind(peaks, c(d$x[i], d$y[i]))
}}
df <- data.frame(test =as.numeric(as.matrix(X)))
g <- ggplot(df, aes(x = as.numeric(as.matrix(test)))) +
geom_density(fill="#69b3a2", color="#e9ecef", alpha=0.8)
ggsave(filename=paste("/work/gad/shared/analyse/STR/Marine/analysis/output/annotation/R_plots/", n, ".png", sep=""), plot=g)
#q <- plot(d)
#png(file=file_name)
#print(q)
#dev.off()
return(peaks)
}
baseM.num.fltr$peaks <- apply(temp, 1 , d.density, n=names)
I get correctly my peaks but obviously something wrong with the plot. I'm not sure my way to pass the name is correct, or is something else would be better/easier? Thanks for your help! I tried 2 ways for the plot, with or without ggplot2 but not working.
This is the error I get:
NULL
Erreur : `device` must be NULL, a string or a function.
>
Example of my data :
> head(baseM)
motif chromosome intervalle gene localisation
1 AAAAAAAAAAAAAAAAAAAC chr2 (69131154, 69132154) BMP10 intergenic
2 AAAAAAAAAAAAAAAAAAAC chr2 (237411093, 237412093) IQCA1 intronic
3 AAAAAAAAAAAAAAAAAAAC chr2 (44378070, 44379070) LRPPRC intergenic
4 AAAAAAAAAAAAAAAAAAAC chr2 (105218444, 105219444) LINC01102 intergenic
5 AAAAAAAAAAAAAAAAAAAC chr2 (124310903, 124311903) LINC01826 intergenic
6 AAAAAAAAAAAAAAAAAAAC chr2 (30730559, 30731559) LCLAT1 intronic
OMIM
1 .
2 .
3 .,Mitochondrial complex IV deficiency, nuclear type 5, (French-Canadian), 220111 (3)
4 .
5 .
6 .
dijen003 dijen004 dijen005 dijen006 dijen007 dijen008 dijen009 dijen010
1 NA NA NA NA NA NA NA NA
2 7 NA NA NA NA NA NA NA
3 NA NA NA 5 NA NA NA NA
4 NA NA NA 5 NA NA NA NA
5 NA NA NA 5 NA NA NA NA
6 NA NA NA NA 5 NA NA NA
dijen011 dijen012 dijen013 dijen014 dijen015 dijen016 dijen017 dijen018
1 NA NA NA NA NA NA NA NA
2 NA NA NA NA NA NA NA NA
3 NA NA NA NA NA NA NA NA
4 NA NA NA NA NA NA NA NA
5 NA NA NA NA NA NA NA NA
6 NA NA NA NA NA NA NA NA
(Sorry i know it's a short example but data is really big - and of course not all lines have that much NA)

For the device argument, use png or 'png'. (Note that png() will work also but only when the filename has the '.png' extension.)
(png() will work also but only when the filename includes the '.png' extension, see comment thread below.)
Example:
library(tidyverse)
set.seed(1L)
df <- tibble(a = rnorm(10))
df %>% ggplot(aes(a)) + geom_density()
ggsave("foo.png", device = "png")

Searching pairs in matrix in R

I am rather new to R, so I would be grateful if anyone could help me :)
I have a large matrices, for example:
matrix
and a vector of genes.
My task is to search the matrix row by row and compile pairs of genes with mutations (on the matrix is D707H) with the rest of the genes contained in the vector and add it to a new matrix. I tried do this with loops but i have no idea how to write it correctly. For this matrix it should look sth like this:
PR.02.1431
NBN BRCA1
NBN BRCA2
NBN CHEK2
NBN ELAC2
NBN MSR1
NBN PARP1
NBN RNASEL
Now i have sth like this:
my idea
"a" is my initial matrix.
Can anyone point me in the right direction? :)

Perhaps what you want/need is which(..., arr.ind = TRUE).
Some sample data, for demonstration:
set.seed(2)
n <- 10
mtx <- array(NA, dim = c(n, n))
dimnames(mtx) <- list(letters[1:n], LETTERS[1:n])
mtx[sample(n*n, size = 4)] <- paste0("x", 1:4)
mtx
# A B C D E F G H I J
# a NA NA NA NA NA NA NA NA NA NA
# b NA NA NA NA NA NA NA NA NA NA
# c NA NA NA NA NA NA NA NA NA NA
# d NA NA NA NA NA NA NA NA NA NA
# e NA NA NA NA NA NA NA NA NA NA
# f NA NA NA NA NA NA NA NA NA NA
# g NA "x4" NA NA NA "x3" NA NA NA NA
# h NA NA NA NA NA NA NA NA NA NA
# i NA "x1" NA NA NA NA NA NA NA NA
# j NA NA NA NA NA NA "x2" NA NA NA
In your case, it appears that you want anything that is not an NA or NaN. You might try:
which(! is.na(mtx) & ! is.nan(mtx))
# [1] 17 19 57 70
but that isn't always intuitive when retrieving the row/column pairs (genes, I think?). Try instead:
ind <- which(! is.na(mtx) & ! is.nan(mtx), arr.ind = TRUE)
ind
# row col
# g 7 2
# i 9 2
# g 7 6
# j 10 7
How to use this: the integers are row and column indices, respectively. Assuming your matrix is using row names and column names, you can retrieve the row names with:
rownames(mtx)[ ind[,"row"] ]
# [1] "g" "i" "g" "j"
(An astute reader might suggest I use rownames(ind) instead. It certainly works!) Similarly for the colnames and "col".
Interestingly enough, even though ind is a matrix itself, you can subset mtx fairly easily with:
mtx[ind]
# [1] "x4" "x1" "x3" "x2"
Combining all three together, you might be able to use:
data.frame(
gene1 = rownames(mtx)[ ind[,"row"] ],
gene2 = colnames(mtx)[ ind[,"col"] ],
val = mtx[ind]
)
# gene1 gene2 val
# 1 g B x4
# 2 i B x1
# 3 g F x3
# 4 j G x2

I know where my misteke was, now i have matrix. Analyzing your code it works good, but that's not exactly what I want to do.
a, b, c, d etc. are organisms and row names are genes (A, B, C, D etc.). I have to cobine pairs of genes where one of it (in the same column) has sth else than NA value. For example if gene A has value=4 in column a I have to have:
gene1 gene2
a A B
a A C
a A D
a A E
I tried in this way but number of elements do not match and i do not know how to solve this.
ind= which(! is.na(a) & ! is.nan(a), arr.ind = TRUE)
ind1=which(macierz==1,arr.ind = TRUE)
ramka= data.frame(
kolumna = rownames(a)[ ind[,"row"] ],
gene1 = colnames(a)[ ind[,"col"] ],
gene2 = colnames(a)[ind1[,"col"]],
#val = macierz[ind]
)
Do you know how to do this in R?

Opening csv of specific sequences: NAs come out of nowhere?

I feel like this is a relatively straightforward question, and I feel I'm close but I'm not passing edge-case testing. I have a directory of CSVs and instead of reading all of them, I only want some of them. The files are in a format like 001.csv, 002.csv,...,099.csv, 100.csv, 101.csv, etc which should help to explain my if() logic in the loop. For example, to get all files, I'd do something like:
id = 1:1000
setwd("D:/")
filenames = as.character(NULL)
for (i in id){
if(i < 10){
i <- paste("00",i,sep="")
}
else if(i < 100){
i <- paste("0",i,sep="")
}
filenames[[i]] <- paste(i,".csv", sep="")
}
y <- do.call("rbind", lapply(filenames, read.csv, header = TRUE))
The above code works fine for id=1:1000, for id=1:10, id=20:70 but as soon as I pass it id=99:100 or any sequence involving numbers starting at over 100, it introduces a lot of NAs.
Example output below for id=98:99
> filenames
098 099
"098.csv" "099.csv"
Example output below for id=99:100
> filenames
099
"099.csv" NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
NA NA NA NA NA NA NA NA NA
"100.csv"
I feel like I'm missing some catch statement in my if() logic. Any insight would be greatly appreciated! :)

You can avoid the loop for creating the filenames
filenames <- sprintf('%03d.csv', 1:1000)
y <- do.call(rbind, lapply(filenames, read.csv, header = TRUE))

#akrun has given you a much better way of solving your task. But in terms of the actual issue with your code, the problem is that for i < 100 you subset by a character vector (implicitly converted using paste) while for i >= 100 you subset by an integer. When you use id = 99:100 this translates to:
filenames <- character(0)
filenames["099"] <- "099.csv" # length(filenames) == 1L
filenames[100] <- "100.csv" # length(filenames) == 100L, with all(filenames[2:99] == NA)
Assigning to a named member of a vector that doesn't yet exist will create a new member at position length(vector) + 1 whereas assigning to a numbered position that is > length(vector) will also fill in every intervening position with NA.

Another approach, although less efficient than #akrun's solution, is with the following function:
merged <- function(id = 1:332) {
df <- data.frame()
for(i in 1:length(id)){
add <- read.csv(sprintf('%03d.csv', id[i]))
df <- rbind(df,add)
}
df
}
Now, you can merge the files with:
dat <- merged(99:100)
Furthermore, you can assign columnnames by inserting the following line in the function just before the last line with df:
colnames(df) <- c(..specify the colnames in here..)

how to extract numbers that are in certain position in the character vector within a data frame

I have a csv file looking like this:
data[1,]"0;0;0;0";"0;0;0;0";"1395,387994;0;0;0";"1438,433382;0;0;0";"1477,891654;0;0;0";NA;NA;NA;NA
data[2,]"0;0;0;0";"1129,941435;0;0;0";"1140,702782;0;0;0";"1140,702782;0;0;0";"2415,922401;0;0;0";"2469,729136;0;0;0";"2545,058565;0;0;0";NA;NA
data[3,]"0;0;0;0";"0;0;0;0";"0;0;0;0";"0;0;0;0";"1506,58858;0;0;0";"1506,58858;0;0;0";"1517,349927;0;0;0";"1528,111274;0;0;0";NA
basically its 238 by 581 data frame. What I want is to keep NA's as NA's, to convert "0;0;0;0"'s into NA's and get the first number from objects where their is a non-zero value for the first position like "1506,58858;0;0;0".
result should look like this:
data[1,] NA NA 1395,387994 1438,433382 1140,702782 Na NA NA NA
data[2,] NA 1129,941435 1140,702782 1140,702782 2415,922401 2469,729136 2545,058565 NA NA
data[2,] NA NA NA NA 1506,58858 1506,58858 1517,349927 1528,111274 NA
I read my data like this:
f0=read.table("D:../f0.per.call.csv",sep=";",na.strings =c("NA","0;0;0;0"),stringsAsFactors = FALSE)
I know it is very easy task but I can't figure it out, I keep on getting errors when I am trying to convert characters to numerical values.. Any help will be appreciated,
thanks.

I would do it in 2 steps, after I read the file:
replace "0;0;0;0" by NA
use regular expression to remove "0;0;0;" at the end of some columns
Here is the code I used to replace the "0;0;0":
dat <- read.table("D:../f0.per.call.csv",
sep=";",na.strings =c("NA"),stringsAsFactors = FALSE)
dat[dat=="0;0;0;0"] <- NA
sapply(dat,function(x)gsub("(.*);0;0;0","\\1",x))
V1 V2 V3 V4 V5 V6 V7 V8 V9
[1,] NA NA "1395,387994" "1438,433382" "1477,891654" NA NA NA NA
[2,] NA "1129,941435" "1140,702782" "1140,702782" "2415,922401" "2469,729136" "2545,058565" NA NA
[3,] NA NA NA NA "1506,58858" "1506,58858" "1517,349927" "1528,111274" NA

After reading in your data, you can use strsplit and extract just the first item using lapply/sapply/vapply. Here's an example:
f0 <- read.table("D:../f0.per.call.csv", sep=";",
na.strings = c("NA","0;0;0;0"),
stringsAsFactors = FALSE)
f0[] <- lapply(f0, function(y)
vapply(strsplit(as.character(y), ";"),
function(z) z[[1]], ""))
f0
# V1 V2 V3 V4 V5 V6 V7 V8 V9
# 1 <NA> <NA> 1395,387994 1438,433382 1477,891654 <NA> <NA> <NA> <NA>
# 2 <NA> 1129,941435 1140,702782 1140,702782 2415,922401 2469,729136 2545,058565 <NA> <NA>
# 3 <NA> <NA> <NA> <NA> 1506,58858 1506,58858 1517,349927 1528,111274 <NA>
The result here is a data.frame, just like the input was a data.frame.

R Loop Script to Create Many, Many Variables

I want to create a lot of variables across several separate dataframes which I will then combine into one grand data frame.
Each sheet is labeled by a letter (there are 24) and each sheet contributes somewhere between 100-200 variables. I could write it as such:
a$varible1 <- NA
a$variable2 <- NA
.
.
.
w$variable25 <- NA
This can/will get ugly, and I'd like to write a loop or use a vector to do the work. I'm having a heck of a time doing it though.
I essentially need a script which will allow me to specify a form and then just tack numbers onto it.
So,
a$variable[i] <- NA
where [i] gets tacked onto the actual variable created.

I just learnt this neat little trick from #eddi
#created some random dataset with 3 columns
library(data.table)
a <- data.table(
a1 = c(1,5),
a2 = c(2,1),
a3 = c(3,4)
)
#assuming that you now need to ad more columns from a4 to a200
# first, creating the sequence from 4 to 200
v = c(4:200)
# then using that sequence to add the 197 more columns
a[, paste0("a", v) :=
NA]
# now a has 200 columns, as compared to the three we initiated it with
dim(a)
#[1] 2 200

I don't think you actually need this, although you seem to think so for some reason.
Maybe something like this:
a <- as.data.frame(matrix(NA, ncol=10, nrow=5))
names(a) <- paste0("Variable", 1:10)
print(a)
# Variable1 Variable2 Variable3 Variable4 Variable5 Variable6 Variable7 Variable8 Variable9 Variable10
# 1 NA NA NA NA NA NA NA NA NA NA
# 2 NA NA NA NA NA NA NA NA NA NA
# 3 NA NA NA NA NA NA NA NA NA NA
# 4 NA NA NA NA NA NA NA NA NA NA
# 5 NA NA NA NA NA NA NA NA NA NA

If you want variables with different types:
p <- 10 # number of variables
N <- 100 # number of records
vn <- vector(mode="list", length=p)
names(vn) <- paste0("V", seq(p))
vn[1:8] <- NA_real_ # numeric
vn[9:10] <- NA_character_ # character
df <- as.data.frame(lapply(vn, function(x, n) rep(x, n), n=N))

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex