Is it possible to make a heatmap with 200k rows? - r

I would like to know if it is possible to make a heatmap with 200k rows? I have a matrix with genomic coordinates as rows and each column represents the presence or absence of that region in each patients. so I have 15 patient (columns). Now when I am trying heatmap.2 or pheatmap I get the memory allocation problem, how will I be able to use the entire matrix to generate a heatmap. The values of my matrix are jsut 0 and 1 and I want to draw some hypothesis based on the heatmap. How will I be able to use it. I tried once on my laptop but it does not help , so I tried in our linux cluster which is quite powerful. But still shows the below error. How can I resolve this problem. I also add the sessionInfo() . Any workaround is appreciated
data<-read.delim("path/H3_marks_map.txt",sep="\t", row.names=1)
pdf(file="path/map/H3_maps1.pdf")
pheatmap(data,scale="none")
Error: cannot allocate vector of size 170.5 Gb
dev.off()
sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=C LC_NUMERIC=C LC_TIME=C.UTF-8
[4] LC_COLLATE=C.UTF-8 LC_MONETARY=C.UTF-8 LC_MESSAGES=C.UTF-8
[7] LC_PAPER=C.UTF-8 LC_NAME=C LC_ADDRESS=C
[10] LC_TELEPHONE=C LC_MEASUREMENT=C.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] RColorBrewer_1.1-2 pheatmap_1.0.7
loaded via a namespace (and not attached):
[1] colorspace_1.2-6 grid_3.1.2 gtable_0.1.2 munsell_0.4.2
[5] plyr_1.8.3 Rcpp_0.12.1 scales_0.3.0 tools_3.1.2

Related

Mclust freezes with small dataset

I am trying to use the Mclust() function from the R-package mclust on a dataset with 500 observations and 2 variables, and I want to identify 2 clusters.
> head(data)
x y
1 0.9929185 -1.9662945
2 8.2259360 -0.7240049
3 3.3866952 -1.8054764
4 -0.5161490 -2.3096992
5 1.8931073 -1.8928091
6 4.0833228 -1.9045669
> Mclust(data, G = 2)
fitting ...
|=============================================================== | 67%
This should produce an output relatively quickly, but freezes at 67%.
I ran this function multiple times over different datasets, and had no problems whatsoever. It even works if I only include observations up to row 498, but fails as soon as row 499+ is included.
498 -1.710175250 -1.612248596
499 -5.666497204 5.565422240
500 -3.649579976 1.552779499
I have uploaded the whole dataset in my GitHub repository: https://github.com/fstermann/bthesis/tree/main/MclustFreeze
I would greatly appreciate if anyone has an idea why this is happing with this specific dataset.
> sessionInfo()
R version 4.0.5 (2021-03-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19042)
Matrix products: default
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] mclust_5.4.7
loaded via a namespace (and not attached):
[1] compiler_4.0.5 tools_4.0.5

Possible bug in as.POSIXct

I am working with time data, and I covert it to POSIXct class (read as strings). When I do this it work with all my data but no with one specific string. What I do is in essences:
Time1 <- '1900-04-01' # First Year then Month then Day
Time1_convert <- as.POSIXct( Time1, format='%Y-%m-%d')
I do this vectorized and all my data is well converted. But with the date 1920-05-01
Time1 <- '1920-05-01'
Time1_convert <- as.POSIXct( Time1, format='%Y-%m-%d' )
This return NA. I have no idea why this happens. If I add to the as.POSIXct function tz = 'GMT'; the time is well convert for all values. What I do not understand is why this happen and why this happen with this specific value when I have tried with more than 1500 different times values.
I add an image of the output:
More code added:
for( m in c(01,02,03,04,05,06,07,08,09,10,11,12)){
print(as.POSIXct(paste0('1920-',m,'-01'),format='%Y-%m-%d'))
}
and the output is:
[1] "1920-01-01 CMT"
[1] "1920-02-01 CMT"
[1] "1920-03-01 CMT"
[1] "1920-04-01 CMT"
[1] NA
[1] "1920-06-01 -04"
[1] "1920-07-01 -04"
[1] "1920-08-01 -04"
[1] "1920-09-01 -04"
[1] "1920-10-01 -04"
[1] "1920-11-01 -04"
[1] "1920-12-01 -04"
Output of sessionInfo():
R version 3.3.3 (2017-03-06)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 9 (stretch)
locale:
[1] LC_CTYPE=es_AR.UTF-8 LC_NUMERIC=C
[3] LC_TIME=es_AR.UTF-8 LC_COLLATE=es_AR.UTF-8
[5] LC_MONETARY=es_AR.UTF-8 LC_MESSAGES=es_AR.UTF-8
[7] LC_PAPER=es_AR.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=es_AR.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods
[7] base
loaded via a namespace (and not attached):
[1] tools_3.3.3
Your local settings appear to be based in Argentina. As it happens, Argentina reset their time zone on that date from UTC-4:16:48 to UTC-4. I think this means that there wasn't a midnight in Argentina on May 5, 1920. When you convert that string to POSIXct, it interprets it at midnight that day in your local time zone, which by coincidence is a time that did not exist in Argentina. (This explains why it was not reproducible for others who tried the same code.)
http://www.statoids.com/tar.html
Locations in Argentina observed Local Mean Time until 1894-10-31 00:00
(as measured after the transition). At that moment, the entire country
synchronized on Córdoba's Local Mean Time, which was UTC-4:16:48. The
next transition occurred at 1920-05-01 00:00, when clocks were set
ahead sixteen minutes and forty-eight seconds to be an even UTC-4.
Argentina remained unified on UTC-4 until its first daylight saving
time was inaugurated in 1931.
If you need a POSIXct object, you might consider:
a) specifying a different time zone where midnight existed on that day.
as.POSIXct("1920-05-01", tz = "UTC")
# Or perhaps other nearby time zones didn't have that specific problem?
b) Storing the time in components, including one for date, and one for time within the day. e.g. time = hour(Time1) + minute(Time1)/60. It's a little unwieldy but it might be possible to perform the date / time calcs you need.

Knitr and data.table

I have an automated report that i produce using knitr. i'm running across the oddest problem. I wrote a function that sums the data by month for several locations. when i run this function in R i get the following result (which is correct):
###NAME MONTH VOL
###1: TOTAL 1 13.00872
###2: TOTAL 2 11.62527
###3: TOTAL 3 12.71313
###4: TOTAL 4 12.67269
###5: TOTAL 5 15.05127
###6: TOTAL 6 14.61002
###7: TOTAL 7 15.43827
###8: TOTAL 8 15.22400
###9: TOTAL 9 14.91259
###10: TOTAL 10 15.83505
###11: TOTAL 11 14.97242
###12: TOTAL 12 16.34950
when i run this same function (no changes) through knitr to produce the report i get the following result:
###NAME MONTH VOL
###1: TOTAL 1 14.00872
###2: TOTAL 2 13.62527
###3: TOTAL 3 15.71313
###4: TOTAL 4 16.11338
###5: TOTAL 5 17.61269
###6: TOTAL 6 18.46945
###7: TOTAL 7 20.18851
###8: TOTAL 8 21.04382
###9: TOTAL 9 21.72287
###10: TOTAL 10 23.54272
###11: TOTAL 11 23.72971
###12: TOTAL 12 26.03293
i also have another table where knitr just prints non-sense even though the table has actual values in it.
Here is my session info:
R version 3.1.2 (2014-10-31) Platform: x86_64-w64-mingw32/x64 (64-bit)
locale: [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages: [1] stats graphics grDevices utils datasets methods base
other attached packages: [1] lubridate_1.3.3 xtable_1.7-4 shape_1.4.2 reshape2_1.4.1 rgdal_0.9-2 raster_2.2-12
[7] sp_1.0-17 png_0.1-7 data.table_1.9.2
loaded via a namespace (and not attached): [1] digest_0.6.8 evaluate_0.5.5 formatR_1.1 grid_3.1.2 knitr_1.9 lattice_0.20-29 memoise_0.2.1
[8] packrat_0.4.3 plyr_1.8.1 Rcpp_0.11.5 stringr_0.6.2 tools_3.1.2
UPDATE
i pinpointed the problem on at least one of the tables that this error occurs. The problem was the setnames function and the key merge feature of data.table.
when the merge happens R recognizes duplicate column names using a ".1" notation (i.e., if table1 and table 2 both have columns names CHEM then TABLE = table1[table2] has columns named CHEM and CHEM.1) whereas knitr is transforming them into CHEM and i.CHEM. to fix this, i originally used the code setnames(TABLE,names(TABLE),c(New column names)). but this didn't recognize the names(TABLE) in the correct order so i was renaming the wrong columns. but this error only happened when it was passed through knitr. when i ran this code through R alone it worked properly. What is the diconnect between knitr and data.table?
I will work on getting an example code up but as it stands the code would need to be simplified to make posting an example helpful.

"Error in colnames" when merging xts sets

I am trying to make an irregular multivariate time series regular. I am doing this by merging the irregular time series (one measure every 7 days) with a regular "NA" filled time series (daily measures) as suggested by:
- Joshua Ulrich here.
- Dirk Eddelbuettel here.
When I try this method for multivariate time series, I get the error:
"Error in colnames<-(*tmp*, value = c("C.1", "C.2", "C.1.1", "C.2.1" : length of 'dimnames' [2] not equal to array extent"
My question is 2 fold:
How can I merge these two xts data sets without getting this error?
Is there a "better" way of making an irregular multivariate time series regular? I guess I was expecting to find a method in the xts package, but could not find one.
Code to Reproduce Error:
require(xts)
set.seed(42)
# make irregular index
irr_index <- seq(from=as.Date("2010-01-19"), length.out=10, by=7)
# make irregular xts
irr_xts <- xts( x= matrix( data= rnorm(20), ncol= 2,
dimnames= list(c(1:length(irr_index)),
c("C.1", "C.2"))),
order.by= irr_index)
# make regular index
reg_index <- seq(from=as.Date(start(irr_xts)), to=as.Date(end(irr_xts)), by=1)
empty <- xts(matrix(data = NA,
nrow = length(reg_index),
ncol = ncol(irr_xts)),
reg_index )
reg_xts <- na.fill(merge(irr_xts, empty), fill=0)
In practice my real data are sporadic, sometimes daily, sometimes skipping several days. My approach is to normalize all data to 1 observation per day with 0 for days with missing values.
Thanks in advance.
EDIT:
Here is my sessionInfo() as requested:
R version 3.0.2 (2013-09-25)
Platform: x86_64-w64-mingw32/x64 (64-bit)
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] xts_0.9-7 zoo_1.7-10
loaded via a namespace (and not attached):
[1] grid_3.0.2 lattice_0.20-24 tools_3.0.2
This works fine for me, I just follow Joshua Ulrich link :
empty <- xts(,reg_index ) ## No need to set coredata to create empty xts
merge(irr_xts, empty, fill=0)
C.1 C.2
2010-01-19 1.370958 1.30487
2010-01-20 0.000000 0.00000
2010-01-21 0.000000 0.00000
2010-01-22 0.000000 0.00000
2010-01-23 0.000000 0.00000
2010-01-24 0.000000 0.00000
.....

Loading stock information of Japan using quantmod package in R [closed]

This question is unlikely to help any future visitors; it is only relevant to a small geographic area, a specific moment in time, or an extraordinarily narrow situation that is not generally applicable to the worldwide audience of the internet. For help making this question more broadly applicable, visit the help center.
Closed 10 years ago.
I encounter one problem of using R/quantmod package. I can get the stock information for Korea, but I failed in getting the information for Japan:
getSymbols("DEXKOUS",src="FRED") #load Korea
[1] "DEXKOUS"
getSymbols("DEXJPUS",src="FRED") #load Japan
Error as.POSIXlt.character(x, tz, ...) :
character string is not in a standard unambiguous format
Your comments are welcome.
sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: i386-pc-mingw32/i386 (32-bit)
locale:
[1] LC_COLLATE=Chinese (Simplified)_People's Republic of China.936
[2] LC_CTYPE=Chinese (Simplified)_People's Republic of China.936
[3] LC_MONETARY=Chinese (Simplified)_People's Republic of China.936
[4] LC_NUMERIC=C
[5] LC_TIME=Chinese (Simplified)_People's Republic of China.936
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] PerformanceAnalytics_1.0.3.2 quantmod_0.3-17 TTR_0.21-0
[4] xts_0.8-2 zoo_1.7-4 Defaults_1.1-1
[7] Rweibo_0.0-5 rjson_0.2.5 digest_0.5.1
[10] RCurl_1.6-6.1 bitops_1.0-4.1
loaded via a namespace (and not attached):
[1] grid_2.13.1 lattice_0.19-30 tools_2.13.1
The example you give in your question works fine for me too. I get a time series of the yen-dollar rate from 1971 onwards.
However, if you are looking for share price data rather than forex (you did say stock information?) then perhaps you should try the RFinanceYJ package, which extracts share price data from Yahoo! Japan.
require(RFinanceYJ)
sony <- quoteStockXtsData('6758.t', '2011-01-01')
tail(sony,30)

Resources