Error in simpleLoess: NA/NaN/Inf in foreign function call - r

I am trying to use normalize.loess() through lumiN() from lumi package.
At the 38th iteration, in loess() function it fails with
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, :
NA/NaN/Inf in foreign function call (arg 1)
I have searched and it may be related with the fact that an argument is missing.
I checked with debug(loess) and all arguments are defined.
I cannot post data because they are very large (13237x566) and also because they are confidential but.. I found this:
a minimal example works (random matrix 20x5)
normalization fails between column 1 and 38
same normalization using only those columns completed successfully
it is not a memory issue
matrix has not NA values
What am I missing?
Thanks
Code
raw_matrix <- lumiR('example.txt')
norm_matrix <- lumiN(raw_matrix, method='loess')
Perform loess normalization ...
Done with 1 vs 2 in iteration 1
Done with 1 vs 3 in iteration 1
Done with 1 vs 4 in iteration 1
Done with 1 vs 5 in iteration 1
Done with 1 vs 6 in iteration 1
Done with 1 vs 7 in iteration 1
Done with 1 vs 8 in iteration 1
Done with 1 vs 9 in iteration 1
Done with 1 vs 10 in iteration 1
Done with 1 vs 11 in iteration 1
Done with 1 vs 12 in iteration 1
Done with 1 vs 13 in iteration 1
Done with 1 vs 14 in iteration 1
Done with 1 vs 15 in iteration 1
Done with 1 vs 16 in iteration 1
Done with 1 vs 17 in iteration 1
Done with 1 vs 18 in iteration 1
Done with 1 vs 19 in iteration 1
Done with 1 vs 20 in iteration 1
Done with 1 vs 21 in iteration 1
Done with 1 vs 22 in iteration 1
Done with 1 vs 23 in iteration 1
Done with 1 vs 24 in iteration 1
Done with 1 vs 25 in iteration 1
Done with 1 vs 26 in iteration 1
Done with 1 vs 27 in iteration 1
Done with 1 vs 28 in iteration 1
Done with 1 vs 29 in iteration 1
Done with 1 vs 30 in iteration 1
Done with 1 vs 31 in iteration 1
Done with 1 vs 32 in iteration 1
Done with 1 vs 33 in iteration 1
Done with 1 vs 34 in iteration 1
Done with 1 vs 35 in iteration 1
Done with 1 vs 36 in iteration 1
Done with 1 vs 37 in iteration 1
Done with 1 vs 38 in iteration 1
Error in simpleLoess(y, x, w, span, degree, parametric, drop.square, normalize, :
NA/NaN/Inf in foreign function call (arg 1)
Environment
My sessionInfo() is
> sessionInfo()
R version 3.0.2 (2013-09-25)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] affy_1.38.1 lumi_2.12.0 Biobase_2.20.0
[4] BiocGenerics_0.6.0 BiocInstaller_1.10.2
loaded via a namespace (and not attached):
[1] affyio_1.28.0 annotate_1.38.0 AnnotationDbi_1.22.6
[4] beanplot_1.1 Biostrings_2.28.0 colorspace_1.2-4
[7] DBI_0.2-7 GenomicRanges_1.12.5 grid_3.0.2
[10] illuminaio_0.2.0 IRanges_1.18.1 KernSmooth_2.23-10
[13] lattice_0.20-24 limma_3.16.8 MASS_7.3-29
[16] Matrix_1.0-14 matrixStats_0.8.12 mclust_4.2
[19] methylumi_2.6.1 mgcv_1.7-27 minfi_1.6.0
[22] multtest_2.16.0 nleqslv_2.0 nlme_3.1-111
[25] nor1mix_1.1-4 preprocessCore_1.22.0 RColorBrewer_1.0-5
[28] reshape_0.8.4 R.methodsS3_1.5.2 RSQLite_0.11.4
[31] siggenes_1.34.0 splines_3.0.2 stats4_3.0.2
[34] survival_2.37-4 tcltk_3.0.2 tools_3.0.2
[37] XML_3.98-1.1 xtable_1.7-1 zlibbioc_1.6.0

I somehow figured out what was not working:
I was trying to normalize a log2 matrix. As far as I know normalize.loess by default log transforms the input matrix, so that was going to be log transformed twice.
This was a problem, because some values in input matrix were equal to 1, so:
log2(log2(1)) = Inf
that clearly is not allowed as value during normalization.
Hope this helps someone.

Related

How to subtract R dataframe columns based on information in other dataframes?

I have a dataframe that I'd like to add new columns to but where the calculation is dependant on values in another dataframe which holds instructions.
I have created a reproducible example below (although in reality there are quite a few more columns),
input dataframes:
base <- data.frame("A"=c("orange","apple","banana"),
"B"=c(5,3,6),
"C"=c(7,12,4),
"D"=c(5,2,7),
"E"=c(1,18,4))
key <- data.frame("cols"=c("A","B","C","D","E"),
"include"=c("no","no","yes","no","yes"),
"subtract"=c("na","A","B","C","D"),
"names"=c("na","G","H","I","J"))
desired output dataframe:
output <- data.frame("A"=c("orange","apple","banana"),
"B"=c(5,3,6),
"C"=c(7,12,4),
"D"=c(5,2,7),
"E"=c(1,18,4),
"H"=c(2,9,-2),
"J"=c(-4,16,-3))
The keys dataframe has a row for each column in the base dataframe and an "include" column that has to be set to "yes" for any calculation to be done. If it is set to yes, then I want to add a new column with a defined name that subtracts a given column.
For example, column "C" in the base dataframe is set to included so I want to create a new column called "H" that has values from column "C" minus values from column "B".
I thought I could do this with a loop but my attempts have not been successful and my searches have not found anything that helped (I'm a bit new). Any help would be much appreciated.
sessioninfo():
R version 3.4.2 (2017-09-28)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 18363)
Matrix products: default
locale:
[1] LC_COLLATE=English_United Kingdom.1252 LC_CTYPE=English_United Kingdom.1252
[3] LC_MONETARY=English_United Kingdom.1252 LC_NUMERIC=C
[5] LC_TIME=English_United Kingdom.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
Here is a base R option
k <- subset(key, include == "yes")
output <- cbind(base,setNames(base[k[["cols"]]]-base[k[["subtract"]]],k$names))
and we will get
> output
A B C D E H J
1 orange 5 7 5 1 2 -4
2 apple 3 12 2 18 9 16
3 banana 6 4 7 4 -2 -3
Does the following work for you?
output <- base
for(i in which(key[["include"]] == "yes")){
key.row <- key[i, ]
output[[key.row[["names"]]]] <- base[[key.row[["cols"]]]] - base[[key.row[["subtract"]]]]
}
Result:
> output
A B C D E H J
1 orange 5 7 5 1 2 -4
2 apple 3 12 2 18 9 16
3 banana 6 4 7 4 -2 -3

R: Natural Language Processing on Support Vector Machine-TermDocumentMatrix

I have started working on a project which requires Natural Language Processing and building a model on Support Vector Machine (SVM) in R.
I’d like to generate a Term Document Matrix with all the tokens.
Example:
testset <- c("From month 2 the AST and total bilirubine were not measured.", "16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.", "M6 is 13 days out of the visit window")
word_ann <- Maxent_Word_Token_Annotator()
sent_ann <- Maxent_Sent_Token_Annotator()
test_annotations <- annotate(testset, list(sent_ann, word_ann))
test_doc <- AnnotatedPlainTextDocument(testset, test_annotations)
sents(test_doc)
[[1]]
[1] "From" "month" "2" "the" "AST" "and" "total"
[8] "bilirubine" "were" "not" "measured" "."
[[2]]
[1] "16:OTHER" "-"
[3] "COMMENT" "REQUIRED"
[5] "IN" "COMMENT"
[7] "COLUMN;07/02/2004/GENOTYPING;SF-" "genotyping"
[9] "consent" "not"
[11] "offered" "until"
[13] "T4" "."
[[3]]
[1] "M6" "is" "13" "days" "out" "of" "the" "visit" "window"
And then I generated a TDM:
tdm <- TermDocumentMatrix(as.VCorpus(list(test_doc)))
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 1)>>
Non-/sparse entries: 22/0
Sparsity : 0%
Maximal term length: 32
Weighting : term frequency (tf)
Docs
Terms NULL
16:other 1
and 1
ast 1
bilirubine 1
column;07/02/2004/genotyping;sf- 1
comment 2
consent 1
days 1
from 1
genotyping 1
measured 1
month 1
not 2
offered 1
out 1
required 1
the 2
total 1
until 1
visit 1
were 1
window 1
I actually have three documents in the dataset:
"From month 2 the AST and total bilirubine were not measured.",
"16:OTHER - COMMENT REQUIRED IN COMMENT COLUMN;07/02/2004/GENOTYPING;SF- genotyping consent not offered until T4.",
"M6 is 13 days out of the visit window" so it should have shown 3 columns of documents.
But I only have one column shown here.
Could anyone please give me some advice on this?
sessionInfo()
R version 3.3.0 (2016-05-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tm_0.6-2 openxlsx_3.0.0 magrittr_1.5 RWeka_0.4-28 openNLP_0.2-6 NLP_0.1-9
[7] rJava_0.9-8
I think what you are trying to do is take a list of 3 strings and then trying to make that into corpus. I am not sure if in a list 3 different strings count for 3 diff documents.
I took your data and put it into 3 txt files and ran this.
text_name <- file.path("C:\", "texts")
dir(text_name)
[1] "text1.txt" "text2.txt" "text3.txt"
if you dont want to do any cleaning you can directly convert it to corpus by
docs <- Corpus(DirSource(text_name))
summary(docs)
Length Class Mode
text1.txt 2 PlainTextDocument list
text2.txt 2 PlainTextDocument list
text3.txt 2 PlainTextDocument list
dtm <- DocumentTermMatrix(docs)
dtm
<<DocumentTermMatrix (documents: 3, terms: 22)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
tdm <- TermDocumentMatrix(docs)
tdm
TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
inspect(tdm)
<<TermDocumentMatrix (terms: 22, documents: 3)>>
Non-/sparse entries: 24/42
Sparsity : 64%
Maximal term length: 32
Weighting : term frequency (tf)
Docs
Terms text1.txt text2.txt text3.txt
16:other 0 1 0
and 1 0 0
ast 1 0 0
bilirubine 1 0 0
column;07/02/2004/genotyping;sf- 0 1 0
comment 0 2 0
consent 0 1 0
days 0 0 1
from 1 0 0
genotyping 0 1 0
measured. 1 0 0
month 1 0 0
not 1 1 0
offered 0 1 0
out 0 0 1
required 0 1 0
the 1 0 1
total 1 0 0
until 0 1 0
visit 0 0 1
were 1 0 0
window 0 0 1
I think you might want to create 3 different list and then covert it into corpus. let me know if this helps.
So considering you want each row in your column of text as document
coverting the list to dataframe
df=data.frame(testset)
install.package("tm")
docs=Corpus(VectorSource(df$testset))
summary(docs)
Length Class Mode
1 2 PlainTextDocument list
2 2 PlainTextDocument list
3 2 PlainTextDocument list
follow the steps mentioned in the previous answer after this to get your tdm. this should solve your problem

Selecting a subset of IDs

I have a data.table dt with ids on the column idnum and a data.table ids that contains a list of ids in the column idnum (all of which exist on dt)
I want to get
The intersection: dt where dt.idnum ==ids.idnum`
The complement to the intersection: dt where dt.idnum not in ids.idnum
I got the first one with ease using
setkey(dt, idnum)
setkey(ids, idnum)
dt[ids]
However, Im stuck getting the second one. My approach was
dt[is.element(idnum, ids[, idnum]) == FALSE]
However, the row numbers of the two groups do not add up to nrow(dt). I suspect the second command. What can I do instead / Where am I going wrong? Is there perhaps a more efficient way of computing the second group given that it's the complement to the first group and I already have that one?
Update
I tried the approach given in the answer, but my numbers don't add up:
> nrow(x[J(ids$idnum)])
[1] 148
> nrow(x[!J(ids$idnum)])
[1] 52730
> nrow(x)
[1] 52863
While, the first two numbers added yield 52878. That is, I have 15 rows too many. My data contains duplicates in adj, could that be the reason?
Here's some description of the data I used:
> str(x)
Classes 'data.table' and 'data.frame': 52863 obs. of 1 variable:
$ idnum: int 6 6 11 21 22 22 22 22 27 27 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "idnum"
> head(x)
idnum
1: 6
2: 6
3: 11
4: 21
5: 22
6: 22
> str(ids)
Classes 'data.table' and 'data.frame': 46 obs. of 1 variable:
$ idnum: int 2909 5012 5031 5033 5478 6289 6405 6519 7923 7940 ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "idnum"
> head(ids)
idnum
1: 2909
2: 5012
3: 5031
4: 5033
5: 5478
6: 6289
and here is
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] C/C/C/C/C/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
other attached packages:
[1] yaml_2.1.13 ggplot2_1.0.0 mFilter_0.1-3
[4] data.table_1.9.4 foreign_0.8-61
loaded via a namespace (and not attached):
[1] MASS_7.3-35 Rcpp_0.11.3 chron_2.3-45
[4] colorspace_1.2-4 digest_0.6.4 grid_3.1.1
[7] gtable_0.1.2 labeling_0.3 munsell_0.4.2
[10] plyr_1.8.1 proto_0.3-10 reshape2_1.4
[13] scales_0.2.4 stringr_0.6.2 tools_3.1.1
Here is one way:
library(data.table)
set.seed(1) # for reproducible example
dt <- data.table(idnum=1:1e5,x=rnorm(1e5)) # 10,000 rows, unique ids
ids <- data.table(idnum=sample(1:1e5,10)) # 10 random ids
setkey(dt,idnum)
result.1 <- dt[J(ids$idnum)] # inclusive set (records with common ids)
result.2 <- dt[!J(ids$idnum)] # exclusive set (records from dt with ids$idnum excluded
any(result.2$idnum %in% result.1$isnum)
# [1] FALSE
EDIT: Response to OPs comment.
Comparing the number of rows is not meaningful. The join will return rows corresponding to all matches. So if a given idnum is present twice in dt and three times in ids, you will get 2 X 3 = 6 rows in the result. The important test is the one I did: are any of the ids in result.1 also present in result.2? If so, then there's something wrong.
If you have duplicated ids$idnum, try:
result.1 <- dt[J(unique(ids$idnum))] # inclusive set (records with common ids)

ff package in R: how to move data from one drive to another, and change filenames

I am working intensively with the amazing ff and ffbase package.
Due to some technical details, I have to work in my C: drive with my R session. After finishing that, I move the generated files to my P: drive (using cut/paste in windows, NOT using ff).
The problem is that when I load the ffdf object:
load.ffdf("data")
I get the error:
Error: file.access(filename, 0) == 0 is not TRUE
This is ok, because nobody told the ffdf object that it was moved, but trying :
filename(data$x) <- "path/data_ff/x.ff"
or
pattern(data) <- "./data_ff/"
does not help, giving the error:
Error in `filename<-.ff`(`*tmp*`, value = filename) :
ff file rename from 'C:/DATA/data_ff/id.ff' to 'P:/DATA_C/data_ff/e84282d4fb8.ff' failed.
Is there any way to "change" into the ffdf object the path for the files new location?
Thank you !!
If you want to 'correct' your filenames afterwards you can use:
physical(x)$filename <- "newfilename"
For example:
> a <- ff(1:20, vmode="integer", filename="./a.ff")
> saveRDS(a, "a.RDS")
> rm(a)
> file.rename("./a.ff", "./b.ff")
[1] TRUE
> b <- readRDS("a.RDS")
> b
ff (deleted) integer length=20 (20)
> physical(b)$filename <- "./b.ff"
> b[]
opening ff ./b.ff
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Using filename() in the first session would of course have been easier. You could also have a look at the save.ffdf and corresponding load.ffdf functions in the ffbase package, which make this even simpler.
Addition
To rename the filenames of all columns in a ffdf you can use the following function:
redir <- function(ff, newdir) {
for (x in physical(b)) {
fn <- basename(filename(x))
physical(x)$filename <- file.path(newdir, fn)
}
return (ff)
}
You can also use ff:::clone()
R> foo <- ff(1:20, vmode = "integer")
R> foo
ff (open) integer length=20 (20)
[1] [2] [3] [4] [5] [6] [7] [8] [13] [14] [15] [16] [17] [18] [19]
1 2 3 4 5 6 7 8 : 13 14 15 16 17 18 19
[20]
20
R> physical(foo)$filename
[1] "/vol/fftmp/ff69be3e90e728.ff"
R> bar <- clone(foo, pattern = "~/")
R> bar
ff (open) integer length=20 (20)
[1] [2] [3] [4] [5] [6] [7] [8] [13] [14] [15] [16] [17] [18] [19]
1 2 3 4 5 6 7 8 : 13 14 15 16 17 18 19
[20]
20
R> physical(bar)$filename
[1] "/home/ubuntu/69be5ec0cf98.ff"
From what I understand from briefly skimming the code of save.ffdf and load.ffdf, those functions do this for you when you save/load.

R: Creating n-grams in R with Asian / Chinese characters?

So I'm trying to create bigrams and trigrams of a given set of text, which just happens to be Chinese. At first glance, the tau package seems almost perfect for the application. Given the following set-up, I get close to what I want:
library(tau)
q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
textcnt(q,method="ngram",n=3L,decreasing=TRUE)
The only problem is that the output is in unicode character strings, not the characters themselves. So I get something like:
_ + < <U <U+ > U U+ 9 +5 5 U+5 >_ _< _<U +59 59 2 29 29> 592 7 92
22 19 19 19 19 19 19 19 17 14 14 14 11 11 11 9 9 8 8 8 8 8 8
929 9> >< ><U 9>_ E +5E 3 3> 3>_ 5E 5E7 6 73 73> A E7 E73 4 8 9>< A> +6
8 8 8 8 5 5 4 4 4 4 4 4 4 4 4 4 4 4 3 3 3 3 2
+7 4> 4>< 7A A>< C U+6 U+7 +4 +4E +5F +66 +6C +76 +7A 0 0A 0A> 1 14 14> 4E 4EC
2 2 2 2 2 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
597 5F 5F8 60 60A 66 660 68 684 6C 6C1 76 768 7A7 7A> 7D 7D> 84 84> 88 88> 8> 8><
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
97 97D A7 A7A A>_ C1 C14 CA CA> D D> D>_ EC ECA F F8 F88 U+4
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
I tried to write something that would perform a similar function, but I can't wrap my head around the code for anything more than a monogram (apologies if the code is inefficient or ugly, I'm doing my best here). The advantage of this method is also that I can get word-counts within individual "documents" by simply examining DTM, which is kind of nice.
data <- c(NA, NA, NA)
names(data) <- c("doc", "term", "freq")
terms <- NA
for(i in 1:length(q)){
temp <- data.frame(i,table(strsplit(q[i],"")))
names(temp) <- c("doc", "term", "freq")
data <- rbind(data, temp)
}
data <- data[-1,]
DTM <- xtabs(freq ~ doc + term, data)
colSums(DTM)
This actually gives a nice little output:
天 平 空 昊 今 好 很 气 的
8 4 1 1 1 1 1 1 1
Does anyone have any suggestions for using tau or altering my own code to achieve bigrams and trigrams for my Chinese characters?
Edit:
As requested in the comments, here is my sessionInfo() output:
R version 3.0.0 (2013-04-03)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] tau_0.0-15
loaded via a namespace (and not attached):
[1] tools_3.0.0
The stringdist package will do that for you:
> library(stringdist)
> q <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
> v1 <- c("天","平","天","平","天","平","天","平","天空","昊天","今天的天气很好")
> t(qgrams(v1, q=1))
V1
天 8
平 4
空 1
昊 1
...
> v2 <- c("天气气","平","很好平","天空天空天空","昊天","今天的天天气很好")
> t(qgrams(v2, q=2))
V1
天气 2
气气 1
空天 2
天空 3
天的 1
天天 3
今天 1
...
The reason why I transpose the returned matrices is because R renders the matrices incorrectly with regards to the column width - which happens to be the length of the unicode-ID character string (f.x. "<U+6C14><U+6C14>").
In case you are interested in further details about the stringdist package - I recommend this text: http://www.joyofdata.de/blog/comparison-of-string-distance-algorithms ;)

Resources