I have two time series- a baseline (x) and one with an event (y). I'd like to cluster based on dissimilarity of these two time series. Specifically, I'm hoping to create new features to predict the event. I'm much more familiar with clustering, but fairly new to time series.
I've tried a few different things with a limited understanding...
Simulating data...
x<-rnorm(100000,mean=1,sd=10)
y<-rnorm(100000,mean=1,sd=10)
This package seems awesome but there is limited information available on SO or Google.
library(TSclust)
d<-diss.ACF(x, y)
the value of d is
[,1]
[1,] 0.07173596
I then move on to clustering...
hc <- hclust(d)
but I get the following error:
Error in if (is.na(n) || n > 65536L) stop("size cannot be NA nor exceed 65536") :
missing value where TRUE/FALSE needed
My assumption is this error is because I only have one value in d.
Alternatively, I've tried the following on a single timeseries (the event).
library(dtw)
distMatrix <- dist(y, method="DTW")
hc <- hclust(y, method="complete")
but it takes FOREVER to run the distance Matrix.
I have a couple of guesses at what is going wrong, but could use some guidance.
My questions...
Do I need a set of baseline and a set of event time series? Or is one pairing ok to start?
My time series are quite large (100000 rows). I'm guessing this is causing the SLOW distMatrix calculation. Thoughts on this?
Any resources on applied clustering on large time series are welcome. I've done a pretty thorough search, but I'm sure there are things I haven't found.
Is this the code you would use to accomplish these goals?
Thanks!
Related
I am using the package "survival" to fit a cox model with time intervals (intervals are 30 days long). I am reading the data in from an xlsx worksheet. I keep getting the error that says my stop time must be greater than my start time. The start values are all smaller than the stop values.
I checked to make sure these are being read in as numbers which they are. I also changed them to integers which did not solve the problem. I used this code to see if any observations met this criterion:
a <- a1[which(a1$end_time > a1$start_time),]
About half the dataset meets this criterion, but when I look at the data all the start times appear to be less than the end times.
Does anyone know why this is happening and how I can fix it? I am an R newbie so perhaps there is something obvious I don't know about?
model1<- survfit(Surv(start_time, end_time, censor) ~ exp, data=a1, weights = weight)
enter image description here
For this, I am using the banknote data in R given by data(banknote), which shows measurements of 200 Swiss banknotes. My data matrix is called X, and I have performed PCA by pca.banknote<-prcomp(X).
I am trying to show that the inner product between each observation X[i,] and Principal Component Loading 3 given by pca.banknote$rot[,3] is the same as the 3rd PC scores given by pca.banknote$x[,3].
I have attempted:
all.equal(as.matrix(X[,])%*%banknote.pca$rot[,3], as.matrix(banknote.pca$x[,3]), check.attributes=FALSE)
but this simply gives a mean difference of 1, i.e. they are not equal.
Do I need to change the format of one of these to a vector/data frame etc for this to work? Or any ideas at all as to where the issue is?
Any feedback would be much appreciated. Thanks.
I am trying to figure out how to read in a counts matrix into R, and then cluster based on euclidean distance and a complete linkage metric. The original matrix has 56,000 rows (genes) and 7 columns (treatments). I want to see if there is a clustering relationship between the treatments. However, every time I try to do this, I first get an error stating, Error: cannot allocate vector of size 544.4 Gb Since I'm trying to reproduce work that has been published by someone else, I am wondering if I am making a mistake with my initial data entry.
Second, if I try such clustering with just 20 genes of the 56,000, I am able to make a clustering dendrogram, but the branches are no experimental samples. The paper I am trying to replicate did such clustering with the resulting dendrogram displaying clustering samples.
Here is the code I am trying to run:
exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(matrix(exprs),method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)
And here is a sample of my data table:
AGS KATOIII MKN45 N87 SNU1 SNU5 SNU16
1_DDR1 11.18467721 11.91358171 11.81568242 11.08565284 8.054326631 12.46899188 10.54972491
2_RFC2 9.19869822 9.609015734 8.925772678 8.3641799 8.550993726 10.32160527 9.421779056
3_HSPA6 6.455324139 6.088320986 7.949175048 6.128573129 6.113793411 6.317460116 7.726657567
4_PAX8 8.511225092 8.719103196 8.706242048 8.705618546 8.696547633 9.292782564 8.710369119
5_GUCA1A 3.773404228 3.797729793 3.574286779 3.848753216 3.684193193 3.66065606 3.88239872
6_UBA7 6.477543321 6.631538303 6.506133756 6.433793116 6.145507918 6.92197071 6.479113995
7_THRA 6.263090367 6.507397854 6.896879084 6.696356125 6.243160864 6.936051147 6.444444498
8_PTPN21 6.88050894 6.342007735 6.55408163 6.099950167 5.836763044 5.904301086 6.097067306
9_CCL5 6.197989448 4.00619542 4.445053893 7.350765625 3.892650264 7.140038596 4.123639647
10_CYP2E1 4.379433632 4.867741561 4.719912827 4.547433566 6.530890968 4.187701905 4.453267508
11_EPHB3 6.655231606 7.984278173 7.025962652 7.111129175 6.246989328 6.169529157 6.546374446
12_ESRRA 8.675023046 9.270153715 8.948209029 9.412638347 9.4470612 9.98312055 9.534236722
13_CYP2A6 6.834018146 7.18386746 6.826740822 7.244411918 6.744588768 6.715122111 7.302922762
14_SCARB1 8.856802264 8.962211232 8.975200168 9.710291176 9.120002571 10.29588004 10.55749325
15_TTLL12 8.659539601 9.93935462 8.309244963 9.21145716 9.792647852 10.46958091 10.51879844
16_LINC00152 5.108632654 4.906321384 4.958158343 5.315532543 5.456138001 5.242577092 5.180295902
17_WFDC2 5.595843025 5.590991341 5.776102664 5.622086284 5.273603946 5.304240608 5.573746302
18_MAPK1 6.970036434 5.739881305 4.927993642 5.807358161 7.368137365 6.17697538 5.985006279
19_MAPK1 8.333269232 8.758733916 7.855324572 9.03596893 7.808283302 7.675434022 7.450262521
20_ADAM32 4.075355477 4.216259982 4.653654879 4.250333684 4.648194266 4.250333684 4.114286071
The rows describe genes (Ex., 1_DDR1, 2_RFC2, etc.) and the columns are experimental samples (Ex. AGS, KATOIII). I wish to see the relatedness of the samples in the cluster.
Here is my sample dendrogram that my code produces. I thought it would only show 7 branches reflecting my 7 samples:
The paper's dendrogram (including these 8 samples and many more as well) is below:
Thanks for any help you can provide!
You're running out of RAM. That's it. You can't allocate a vector that exceeds your memory space. Move to a computer with more memory or maybe, try use bigmemory (I've never tried it).
https://support.bioconductor.org/p/53848/
In case anybody was wondering, the answer to my second question is below. I was calling as.matrix on a matrix, and it was screwing up the data. The following code works now!
exprs <- as.matrix(read.table("small_RMA_table.txt", header=TRUE, sep = "\t", row.names = 1, as.is=TRUE))
eucl_dist=dist(exprs,method = 'euclidean')
hie_clust=hclust(eucl_dist,method = 'complete')
plot(hie_clust)
Do you want to cluster on columns (detect similarities between treatments) or on rows (detect similarities between genes)? It sounds like you want the former, given that you're expecting 7 dendrogram branches for 7 treatments.
If so, then you need to transpose your dataset. dist computes a distance matrix for rows, not columns, which is not what you want.
Once you've done the transpose, your clustering should take no time at all, and minimal memory.
I have five variables. Each variable has some bounds. And I am investing some amount of money on each channel. Now my question is are there any optimizer or logic to find out global maximum value for the given functional form. And sum of combinations should not exceed my total spend.
parameters=c(10,120,105,121,180,140) #intercept and variable coefficients
spend=c(16,120,180,170,180) # total spend
total=sum(spend)
upper_bound=c(50,200,250,220,250)
lower_bound=c(10,70,100,90,70)
var1=seq(lower_bound[1],upper_bound[1],by=1)
var2=seq(lower_bound[2],upper_bound[2],by=1)
var3=seq(lower_bound[3],upper_bound[3],by=1)
var4=seq(lower_bound[4],upper_bound[4],by=1)
var5=seq(lower_bound[5],upper_bound[5],by=1)
functional form is: exp(BETA 0-BETA i/X i)
I have used expand.grid function to find existing combinations. But I am getting too many combinations.
Here is my code.
seq_data=expand.grid(var1=var1,var2=var2,var3=var3,var4=var4,var5=var5)
rs=rowSums(seq_data)
seq_data=seq_data[rs<=total,]
seq_data1=seq_data
for(i in 1:length(seq_data))
seq_data1[,i]=exp(parameters[1]-parameters[i+1]/seq_data1[,i])
How can I overcome this problem. Please suggest me if there are any other alternative.
Thanks in advance.
I would like to inter/extrapolate values(concentration) along a stream network line. So far theoretically the best match would be rtop package in R, but somehow there is a bug and I cannot execute the example data. Do anyone has any other "ready" suggestion using any kind of OS program?
However, I tried to solve the problem in R, but I came a cross several problems.
My dataframe ( I have also shapefiles, stream network, catchments areas)
StartID | EndID | Discharge | Length | Value
First of all I would like to have inverse distance weighted interpolation (IDW), so to find the segments where I have observations and interpolate between the observations for the NA values depending on their distance between the observations.
Secondly I also would like to consider the discharge. When 2 streams join, the stream with higher discharge should have more influence on the concentration in the next segment.
I am able to look for NA values and check if there is observations upstream or downstream of the segment and weighted by discharge and take the mean:
for(i in 1:nrow(DF)) {
if(is.na(DF[i,c("Value")]))
{ a<-merge(DF[i,], DF, by.x=c("StartID"),by.y=c("EndID"), x.all)
a<-a[complete.cases(a[,8]),]
b<-merge(DF[i,], DF, by.x=c("EndID"),by.y=c("StartID"), x.all)
b<-b[complete.cases(b[,8]),]
DF[i,c("Value")] <- mean((sum(a[,c("Discharge.y")]*a[,c("Value.y")])/sum(a[,c("Discharge.y")])),(sum(b[,c("Discharge.y")]*b[,c("Values.y")])/sum(b[,c("Discharge.y")])), na.rm=TRUE, trim=0)
But I think it would be better to look for the observations close to each other and interpolate for the NA values. But I really got stuck. I do not hope for ready-to-use-scripts, but I would be glad if I could get some feedback and directions.
Thanks a lot, Celia