I have a time series and it contains 256 integer values. It looks like this:
I have calculated STFT( Short Time Forier Transform) whit this code in r:
s<-stft(datalist, win=min(80,floor(length(datalist)/10)), inc=min(24,floor(length(datalist)/30)), coef=256, wtype="hanning.window")
as a resault I have a matrix with 29 rows and 256 values. If I show one row of this matrix in a plot( i.e 10th row). I am seeing such a diagramm:
but I have this expectation, that the coefficient diagram should look like the first diagramm?(only in another dimension)
should I use another package in R to do this job? or my understood is false?
I guess you are using the stft function from package GENEArerad. In your case, the call is basically
s<-stft(datalist, win=25, inc=8, coef=256, wtype="hanning.window")
So the way I read it, you are taking 25 samples but computing 256 coefficients from this. The documentation states that the maximal (reasonable) value for coef is win/2, due to the Nyquist-Shannon sampling theorem I guess. So all but the first 12 or so coefficients will be mostly bogus. And those first few coefficients are off the scale of your plot, so we can't say anything about these either.
I don't know where your expectation did come from, and I don't share it. But I also believe there are some more fundamental problems with how you expect this to work.
Related
I apologize ahead of time for the crude way this question is worded. I was under the impression for the longest time that what I'm trying to do is called "Normalizing data" but after googling to try and find the method to do this, I seem to be mistaken so I'm not sure exactly what it's called that I'm trying to do (bear with me please).
I have a set of data like this:
0.17407
0.05013
0.08520
0.02892
0.02986
0.06286
0.04453
0.00425
0.20470
0.02267
0.01470
0.02460
0.01735
0.01069
0.02168
0.13912
0.02004
0.02018
0.07837
When you add them all you get 1.05392.
I'd like to "adjust" the data set so that the relative values all remain the same but the sum is equal to 1. When I googled normalizing data sets, I found a formula like this:
(x-min(x))/(max(x)-min(x))
However, this simply "ranks" each data point as a certain percentage of the maximum value so that your max value in your data set is equal to 1 and the minimum, 0.
Extra: Could someone enlighten me what this is called if not normalizing data. Obviously I've been carrying around this ignorant belief for far too long.
If you want your data to sum to 1 you normalize your data. You normalize by dividing by the sum of you series (sum_i x_i, where x_i are the elements of your data series).
The formula you mention is another possible rescaling, but as you observed it has a different effect. Note that in the first case you map x -> c*x (in your case: x -> 1/1.05392*x), while the second case rescales with x -> c*x + offset. Note also, that the later is not linear (unless min(x) = 0), that is f(x+y) != f(x) + f(y).
If your whole confusion is about the naming of things, than I would not worry to much. After all there is only convention and common agreement, but no absolute truth/authority. And the terms are reused in different fields, cf. Normalization on Wikipedia:
Normalization or normalisation refers to a process that makes something more normal or regular
I've been using the psych package to compare two correlation matrices using the function cortest.
Now I want to try the cortest.mat and cortest.jennrich function which require an object of the class phychand sim. I have tried converting mi correlation matrices with sim.structure which results in an object of such classes but I get an error when running either function.
Here is what I've tried using Random numbers:
Random<-cor(matrix(rnorm(400, 0, .25), nrow=(20), ncol=(20)))
SimRandom<-sim.structure(Random)
class(SimRandom)
cortest.jennrich(SimRandom,SimRandom,n1=400, n2=400)
Yields the following:
Error in if (dim(R1)[1] != p) { : argument is of length zero
I sure I'm doing it wrong 'cause of the error message and 'cause the values in Random and SimRandom are not exactly the same.
Which is the correct way to translate a correlation matrix to a type -phych, sim- to use as input for running cortest.mat?
Thanks in advance.
EDIT: Short explanation on what I want to do. Using Random numbers serves just as an example. The actual correlation matrices to compare are done as follows. I have a huge list of files each composed of 100 observations for a specific genetic location. These files can be grouped into say 20 files based on known genetic relationships, thus I use those groups of files, load them into a matrix as columns and calculate cor(). That gives a correlation matrix. As a control I load random files and treat them the same way. This matrix contains real data, but the grouping is done randomly. In the end I have two correlation matrices 1-That contains the correlations of pre-selected files and 2- that contains the correlations between randomly loaded files. Both matrices are the same size.
What I would like to do is to compare the two correlation matrices to have an idea whether the grouping has an influence on the correlation values observed.
Sorry for not explaining this earlier, I wanted to avoid the long explanation and keep the question simple.
I want to analyse angles in movement of animals. I have tracking data that has 10 recordings per second. The data per recording consists of the position (x,y) of the animal, the angle and distance relative to the previous recording and furthermore includes speed and acceleration.
I want to analyse the speed an animal has while making a particular angle, however since the temporal resolution of my data is so high, each turn consists of a number of minute angles.
I figured there are two possible ways to work around this problem for both of which I do not know how to achieve such a thing in R and help would be greatly appreciated.
The first: Reducing my temporal resolution by a certain factor. However, this brings the disadvantage of losing possibly important parts of the data. Despite this, how would I be able to automatically subsample for example every 3rd or 10th recording of my data set?
The second: By converting straight movement into so called 'flights'; rule based aggregation of steps in approximately the same direction, separated by acute turns (see the figure). A flight between two points ends when the perpendicular distance from the main direction of that flight is larger than x, a value that can be arbitrarily set. Does anyone have any idea how to do that with the xy coordinate positional data that I have?
It sounds like there are three potential things you might want help with: the algorithm, the math, or R syntax.
The algorithm you need may depend on the specifics of your data. For example, how much data do you have? What format is it in? Is it in 2D or 3D? One possibility is to iterate through your data set. With each new point, you need to check all the previous points to see if they fall within your desired column. If the data set is large, however, this might be really slow. Worst case scenario, all the data points are in a single flight segment, meaning you would check the first point the same number of times as you have data points, the second point one less, etc. The means n + (n-1) + (n-2) + ... + 1 = n(n-1)/2 operations. That's O(n^2); the operating time could have quadratic growth with respect to the size of your data set. Hence, you may need something more sophisticated.
The math to check whether a point is within your desired column of x is pretty straightforward, although maybe more sophisticated math could help inform a better algorithm. One approach would be to use vector arithmetic. To take an example, suppose you have points A, B, and C. Your goal is to see if B falls in a column of width x around the vector from A to C. To do this, find the vector v orthogonal to C, then look at whether the magnitude of the scalar projection of the vector from A to B onto v is less than x. There is lots of literature available for help with this sort of thing, here is one example.
I think this is where I might start (with a boolean function for an individual point), since it seems like an R function to determine this would be convenient. Then another function that takes a set of points and calculates the vector v and calls the first function for each point in the set. Then run some data and see how long it takes.
I'm afraid I won't be of much help with R syntax, although it is on my list of things I'd like to learn. I checked out the manual for R last night and it had plenty of useful examples. I believe this is very doable, even for an R novice like myself. It might be kind of slow if you have a big data set. However, with something that works, it might also be easier to acquire help from people with more knowledge and experience to optimize it.
Two quick clarifying points in case they are helpful:
The above suggestion is just to start with the data for a single animal, so when I talk about growth of data I'm talking about the average data sample size for a single animal. If that is slow, you'll probably need to fix that first. Then you'll need to potentially analyze/optimize an algorithm for processing multiple animals afterwards.
I'm implicitly assuming that the definition of flight segment is the largest subset of contiguous data points where no "sub" flight segment violates the column rule. That is to say, I think I could come up with an example where a set of points satisfies your rule of falling within a column of width x around the vector to the last point, but if you looked at the column of width x around the vector to the second to last point, one point wouldn't meet the criteria anymore. Depending on how you define the flight segment then (e.g. if you want it to be the largest possible set of points that meet your condition and don't care about what happens inside), you may need something different (e.g. work backwards instead of forwards).
I am using randomForest package in R platform to build a binary classifier. There are about 30,000 rows with 14,000 being in positive class and 16,000 in negative class. I have 15 variables that have been known to be important for classification.
I have some additional variables (about 5) which have missing information. These variables have values 1 or 0. 1 means presence of something but 0 means that it is not known whether it is present or absent. It is widely known that these variables would be the most important variable for classification (increase reliability of classification and its more likely that the sample lies in positive class) if there is 1 but useless if there is 0. And, only 5% of the rows have value 1. So, one variable is useful for only 5% of the cases. The 5 variables are independent of each other, so I expect that these will be highly useful for 15-25% of the data I have.
Is there a way to make use of available data but neglect the missing/unknown data present in a single column? Your ideas and suggestions would be appreciated. The implementation does not have to be specific to random forest and R-platform. If this is possible using other machine learning techniques or in other platforms, they are also most welcome.
Thank you for your time.
Regards
I can see at least the following approaches. Personally, I prefer the third option.
1) Discard the extra columns
You can choose to discard those 5 extra columns. Obviously this is not optimal, but it is good to know the performance of this option, to compare with the following.
2) Use the data as it is
In this case, those 5 extra columns are left as they are. The definite presence (1) or unknown presence/absence (0) in each of those 5 columns is used as information. This is the same as saying "if I'm not sure whether something is present or absent, I'll treat it as absent". I know this is obvious, but if you haven't tried this, you should, to compare it to option 1.
3) Use separate classifiers
If around 95% of each of those 5 columns has zeroes, and they are roughly independent of each other, that's 0.95^5 = 77.38% of data (roughly 23200 rows) which has zeroes in ALL of those columns. You can train a classifier on those 23200 rows, removing the 5 columns which are all zeroes (since those columns are equal for all points, they have zero predictive utility anyway). You can then train a separate classifier for the remaining points, which will have at least one of those columns set to 1. For these points, you leave the data as it is.
Then, for your test point, if all those columns are zeroes you use the first classifier, otherwise you use the second.
Other tips
If the 15 "normal" variables are not binary, make sure you use a classifier which can handle variables with different normalizations. If you're not sure, normalize the 15 "normal" variables to lie in the interval [0,1] -- you probably won't lose anything by doing this.
I'd like to add a further suggestion to Herr Kapput's: if you use a probabilistic approach, you can treat "missing" as a value which you have a certain probability of observing, either globally or within each class (not sure which makes more sense). If it's missing, it has probability of occurring p(missing), and if it's present it has probability p(not missing) * p(val | not missing). This allows you to gracefully handle the case where the values have arbitrary range when they are present.
I've got about 100M value/count pairs in a text file on my Linux machine. I'd like to figure out what sort of formula I would use to generate more pairs that follow the same distribution.
From a casual inspection, it looks power law-ish, but I need to be a bit more rigorous than that. Can R do this easily? If so, how? Is there something else that works better?
While a bit costly, you can mimic your sample's distribution exactly (without needing any hypothesis on underlying population distribution) as follows.
You need a file structure that's rapidly searchable for "highest entry with key <= X" -- Sleepycat's Berkeley database has a btree structure for that, for example; SQLite is even easier though maybe not quite as fast (but with an index on the key it should be OK).
Put your data in the form of pairs where the key is the cumulative count up to that point (sorted by increasing value). Call K the highest key.
To generate a random pair that follows exactly the same distribution as the sample, generate a random integer X between 0 and K and look it up in that file structure with the mentioned "highest that's <=" and use the corresponding value.
Not sure how to do all this in R -- in your shoes I'd try a Python/R bridge, do the logic and control in Python and only the statistics in R itself, but, that's a personal choice!
To see whether you have a real power law distribution, make a log-log plot of frequencies and see whether they line up roughly on a straight line. If you do have a straight line, you might want to read this article on the Pareto distribution for more on how to describe your data.
I'm assuming that you're interested in understanding the distribution over your categorical values.
The best way to generate "new" data is to sample from your existing data using R's sample() function. This will give you values which follow the probability distribution indicated by your existing counts.
To give a trivial example, let's assume you had a file of voter data for a small town, where the values are voters' political affiliations, and counts are number of voters:
affils <- as.factor(c('democrat','republican','independent'))
counts <- c(552,431,27)
## Simulate 20 new voters, sampling from affiliation distribution
new.voters <- sample(affils,20, replace=TRUE,prob=counts)
new.counts <- table(new.voters)
In practice, you will probably bring in your 100m rows of values and counts using R's read.csv() function. Assuming you've got a header line labeled "values\t counts", that code might look something like this:
dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)
One caveat: as you may know, R keeps all of its objects in memory, so be sure you've got enough freed up for 100m rows of data (storing character strings as factors will help reduce the footprint).