We want to use the dtw library for R in order to shrink and expand certain time series data to a standard length.
Consider, three time series with equivalent columns. moref is of length(rows) 105, mobig is 130 and mosmall is 100. We want to project mobig and mosmall to a length of 105.
moref <- good_list[[2]]
mobig <- good_list[[1]]
mosmall <- good_list[[3]]
Therefore, we compute two alignments.
ali1 <- dtw(mobig, moref)
ali2 <- dtw(mosmall, moref)
If we print out the alignments the result is:
DTW alignment object
Alignment size (query x reference): 130 x 105
Call: dtw(x = mobig, y = moref)
DTW alignment object
Alignment size (query x reference): 100 x 105
Call: dtw(x = mosmall, y = moref)
So exactly what we want? From my understanding we need to use the warping functions ali1$index1 or ali1$index2 in order to shrink or expand the time series. However, if we invoke the following commands
length(ali1$index1)
length(ali2$index1)
length(ali1$index2)
length(ali2$index2)
the result is
[1] 198
[1] 162
[1] 198
[1] 162
These are vector with indices (probably refering to other vectors). Which one of these can we use for the mapping? Aren't they all to long?
First of all, we need to agree that index1 and index2 are two vectors of the same length that maps query/input data to reference/stored data and vice versa.
Since you did not give out any data. Here is some dummy data to give people an idea.
# Reference data is the template that we use as reference.
# say perfect pronunciation from CNN
data_reference <- 1:10
# Query data is the input data that we want to map to our reference
# say random youtube audio
data_query <- seq(1,10,0.5) + rnorm(19)
library(dtw)
alignment <- dtw(x=data_query, y=data_reference, keep=TRUE)
alignment$index1
alignment$index2
lcm <- alignment$costMatrix
image(x=1:nrow(lcm), y=1:ncol(lcm), lcm)
plot(alignment, type="threeway")
Here are the outputs:
> alignment$index1
[1] 1 2 3 4 5 6 7 7 8 9 10 11 12 13 13 14 14 15 16 17 18 19
> alignment$index2
[1] 1 1 1 2 2 3 3 4 5 6 6 6 6 6 7 8 9 9 9 9 10 10
So basically, the mapping from index1 to index2 is how to map input data to the reference data.
i.e. the 10th data point at the input data has been matched to the 6th data point from the template.
index1: Warping function φx(k) for the query
index2: Warping function φy(k) for the reference
-- Toni Giorgino
Per your question, "what is the deal with the length of the index", since it is basically the coordinates of the optimal, path, it could be as long as m+n(really shallow) or min(m,n) (perfect diagonal). Clearly, it is not a one-to-one mapping which might bothers people a little bit, I guess you can do more research from here how to pick up the mapping you want.
I don't know if there is some buildin function functionality to pick up the best one-to-one mapping. But here is one way.
library(plyr)
mapping <- data.frame(index1=alignment$index1, index2=alignment$index2)
mapping <- ddply(mapping, .(index1), summarize, index2_new = max(index2))
Now mapping contains a one-to-one mapping from query to reference. Then you can map the query to the reference and scale the mapped input in whatever way you want.
I am not exactly sure about the content below the line and anyone is more than welcome to make any improvement how the mapping and scaling should work.
References: 1, 2
Related
While stats::cutree() takes an hclust-object and cuts it into a given number of clusters, I'm looking for a function that takes a given amount of elements and attempts to set k accordingly. In other words: Return the first cluster with n elements.
For example:
Searching for the first cluster with n = 9 objects.
library(psych)
data(bfi)
x <- bfi
hclust.res <- hclust(dist(abs(cor(na.omit(x)))))
cutree.res <- cutree(hclust.res, k = 2)
cutree.table <- table(cutree.res)
cutree.table
# no cluster with n = 9 elements
> cutree.res
1 2
23 5
while k = 3 yields
cutree.res <- cutree(hclust.res, k = 3)
# three clusters, whereas cluster 2 contains the required amount of objects
> cutree.table
cutree.res
1 2 3
14 9 5
Is there a more convenient way then iterating over this?
Thanks
You can easily write code for this yourself that only does one pass over the dendrogram rather than calling cutter in a loop.
Just execute the merges one by one and note the cluster sizes. Then keep the one that you "liked" the best.
Note that there might be no such solution. For example on the 1 dimensional data set -11 -10 +10 +11, cutting the dendrogram in merge order will return clusters with 1,2, or 4 elements only. So you'll have to handle this case, too.
I have some data in a 3D grid identified by simple i,j,k locations (no real-world spatial information). These data are in a RasterStack right now.
b <- stack(system.file("external/rlogo.grd", package="raster"))
# add more layers
b <- stack(b,b)
# dimensions
dim(b)
[1] 77 101 6
yields 77 rows, 101 columns, 6 layers.
# upscale by 2
up <- aggregate(b,fact=2)
dim(up)
[1] 39 51 6
yields 39 rows, 51 columns, 6 layers.
Hoped-for behavior: 3 layers.
I'm looking for a method to aggregate across layers in addition to the present behavior, which is to aggregate within each layer. I'm open to other data structures, but would prefer an existing upscaling/resampling/aggregation algorithm to one I write myself.
Potentially related are http://quantitative-advice.gg.mq.edu.au/t/fast-way-to-grid-and-sum-coordinates/110/5 or the spacetime package, which assumes the layers are temporal rather than spatial, adding more complexity.
Supouse you define agg.fact variable to denote the value 2:
agg.fact <- 2
up <- aggregate(b, fact = agg.fact)
dim(up)
[1] 39 51 6
Now we generate a table which indicates which layers will be aggregate with anothers using agg.fact:
positions <- matrix(1:nlayers(b), nrow = nlayers(b)/agg.fact, byrow = TRUE)
[,1] [,2]
[1,] 1 2
[2,] 3 4
[3,] 5 6
And apply a function(in this case mean but could be max``,sum` or another ...) to each pair of layers
up2 <- stack(apply(positions, 1, function(x){
mean(b[[x[1]]], b[[x[2]]])
}))
dim(up2)
[1] 77 101 3
Or if want to aggregate in 3 dimensions (choose if want aggregate 1-2d and then 3d or viceverza):
up3 <- stack(apply(positions, 1, function(x){
aggregate(mean(b[[x[1]]], b[[x[2]]]), fact = agg.fact) #first 3d
#mean(aggregate(b[[x[1]]], fact = agg.fact), aggregate(b[[x[2]]]), fact = agg.fact) ##first 1d-2d
}))
dim(up3)
[1] 39 51 3
I did not read the documentation correctly. To aggregate across layers:
For example, fact=2 will result in a new Raster* object with 2*2=4 times fewer cells. If two numbers are supplied, e.g., fact=c(2,3), the first will be used for aggregating in the horizontal direction, and the second for aggregating in the vertical direction, and the returned object will have 2*3=6 times fewer cells. Likewise, fact=c(2,3,4) aggregates cells in groups of 2 (rows) by 3 (columns) and 4 (layers).
It may be necessary to play with expand=TRUE vs expand=FALSE to get it to work, but this seems inconsistent (I have reported it as a bug).
I have data saved in a text file with couple thousands line. Each line only has one value. Like this
52312
2
3
4
5
7
9
4
5
3
The first value is always roughly 10.000 times bigger than all the other values.
I can read in the data with data<-read.table("data.txt")
When I just use plot(data) all the data have the same y-value, resulting in a line, where the x values just represent the values given from the data.
What I want, however, is that the x-value represents the linenumber and y-value the actual data. So for the above example my values would be (1,52312), (2,2), (3,3), (4,4), (5,5), (6,7), (7,9), (8,4), (9,5), (10,3).
Also, since the first value is way higher than all the other values, I'd like to use a log scale for the y-axis.
Sorry, very new to R.
set.seed(1000)
df = data.frame(a=c(9999999,sample(2:78,77,replace = F)))
plot(x=1:nrow(df), y=log(df$a))
i) set.seed(1000) helps you reproduce the same random numbers from sample() each time you run this code. It makes code reproducible.
ii) type ?sample in R console for documentation.
iii) since you wanted the x-axis to be linenumber - I create it using ":" operator. 1:3 = 1,2,3. Similarily I created a "id" index using 1:nrow(df) which will create based on the dimension of your data.
iv) for log ,just use it simple :). read more about ?plot and its parameters
Try this:
df
x y
1 1 52312
2 2 2
3 3 3
4 4 4
5 5 5
6 6 7
7 7 9
8 8 4
9 9 5
10 10 3
library(ggplot2)
ggplot(df, aes(x, y)) + geom_point(size=2) + scale_y_log10()
I have a dataframe of xyz coordinates of units in 5 different boxes, all 4x4x8 so 128 total possible locations. The units are all different lengths. So even though I know the coordinates of the unit (3 units in, 2 left, and 1 up) I don't know the exact location of the unit in the box (12' in, 14' left, 30' up?). The z dimension corresponds to length and is the dimension I am interested in.
My instinct is to run a for loop summing values, but that is generally not the most efficient in R. The key elements of the for loop would be something along the lines of:
master$unitstartpoint<-if(master$unitz)==1 0
master$unitstartpoint<-if(master$unitz)>1 master$unitstartpoint[i-1] + master$length[i-1]
i.e. the unit start point is 0 if it is the first in the z dimension, otherwise it is the start point of the prior unit + the length of the prior unit. Here's the data:
# generate dataframe
master<-c(rep(1,128),rep(2,128),rep(3,128),rep(4,128),rep(5,128))
master<-as.data.frame(master)
# input basic data--what load number the unit was in, where it was located
# relative other units
master$boxNumber<-master$master
master$unitx<-rep(c(rep(1,32),rep(2,32),rep(3,32),rep(4,32)),5)
master$unity<-c(rep(1,8),rep(2,8),rep(3,8),rep(4,8))
master$unitz<-rep(1:8,80)
# create unique unit ID # based on load number and xyz coords.
transform(master,ID=paste0(boxNumber,unitx,unity,unitz))
# generate how long the unit is. this length will be used to identify unit
# location in the box
master$length<-round(rnorm(640,13,2))
I'm guessing there is a relatively easy way to do this with apply or by but I am unfamiliar with those functions.
Extra info: the unit ID's are unique and the master dataframe is sorted by boxNumber, unitx, unity, and then unitz, respectively.
This is what I am shooting for:
length unitx unity unitz unitstartpoint
15 1 1 1 0
14 1 1 2 15
11 1 1 3 29
13 1 1 4 40
Any guidance would be appreciated. Thanks!
It sounds like you just want a cumulative sum along the z dimesion for each box/x/y combination. I used cumulative sum because otherwise if you reset at 0 when z=1 your definition would be leaving off the length at z=8. We can do this easily with ave
clength <- with(master, ave(length, boxNumber, unitx, unity, FUN=cumsum))
I'm exactly sure which values you want returned, but this column roughly transates to how you were redefining length above. If i combine with the original data and look at the total lenth for the first box for x=1, y=1:4
# head(subset(cbind(master, ml), unitz==8),4)
master boxNumber unitx unity unitz length ID ml
8 1 1 1 1 8 17 1118 111
16 1 1 1 2 8 14 1128 104
24 1 1 1 3 8 10 1138 98
32 1 1 1 4 8 10 1148 99
we see the total lengths for those positions. Since we are using cumsum we are summing that the z are sorted as you have indicated they are. If you just want one total overall length per box/x/y combo, you can replace cumsum with sum.
I have what I think is a tough problem, so I look forward to hearing some options - here is my favorite working example :
cellID X Y Area AvgGFP DeviationGFP AvgRFP DeviationsRFP Slice GUI.ID
1 1 18.20775 26.309859 568 5.389085 7.803248 12.13028 5.569880 0 1
2 2 39.78755 9.505495 546 5.260073 6.638375 17.44505 17.220153 0 1
3 3 30.50000 28.250000 4 6.000000 4.000000 8.50000 1.914854 0 1
4 4 38.20233 132.338521 257 3.206226 5.124264 14.04669 4.318130 0 1
5 5 43.22467 35.092511 454 6.744493 9.028574 11.49119 5.186897 0 1
6 6 57.06534 130.355114 352 3.781250 5.713022 20.96591 14.303546 0 1
7 7 86.81765 15.123529 1020 6.043137 8.022179 16.36471 19.194279 0 1
8 8 75.81932 132.146417 321 3.666667 5.852172 99.47040 55.234726 0 1
9 9 110.54277 36.339233 678 4.159292 6.689660 12.65782 4.264624 0 1
10 10 127.83480 11.384886 569 4.637961 6.992881 11.39192 4.287963 0 1
This is a text file with information regarding an image, I have many others with more rows. Columns X - Y correspond to the X Y pixel coordinates on the image. By entering this command - I get a nice representation of the data in a plot :
p <- ggplot(total_stats[[slice]], aes(X, Y))
p + geom_point(aes(colour = AvgGFP)) + scale_colour_gradient(low = 'white', high = 'black')
What I want to do is the following.
1) ID cells with a threshold above above a certain AvgGFP value, lets say 75. I want to take the identified cells and take their AvgGFP values and put them in a data.frame called hiAvgGFP.
2) ID any cells that are within a certain distance from the hi AvgGFP cells, making sure to exclude the hi AvgGFP used as the center. Let's set the radius to 50. I want to take the identified cells and take their AvgGFP values and put them in a data.frame called surrounding_cells.
3) Next I want to perform this process on all data.frames - there are 40 called slice1-slice40, which are all contained in 'total_stats'
I am imagining the end result to look like this -
2 new data.frames (hiAvgGFP) and surrounding cell (surrounding_cells)
each of these data.frames will have 40 columns containing AvgGFP values from slices 1-40. Since all slices do not have an equal number of rows - fill empty cells in data set with NA
MAN! that was tough typing out! As always any and all help is very much appreciated.
You are quite vague on some critical details regarding your data. If your data is gridded then I would recommend coercing this data into a raster class object and then using a focal function to calculate your conditional cell calculation.
If this data is in fact, not gridded then you could use the functionality of the spdep package to calculate K nearest neighbors using knearneigh or dnearneigh. You can easily coerce your current data into an sp spatialPointsDataFrame object to conduct this type of analysis.
Another alternative is that if you have access to the original rasters then you can apply a focal function to the raster extract function, using the above point locations, to accomplish your goal.
If you have a spatial problem then it is prudent to leverage the spatial classes in R.