Determining distribution so I can generate test data - r

I've got about 100M value/count pairs in a text file on my Linux machine. I'd like to figure out what sort of formula I would use to generate more pairs that follow the same distribution.
From a casual inspection, it looks power law-ish, but I need to be a bit more rigorous than that. Can R do this easily? If so, how? Is there something else that works better?

While a bit costly, you can mimic your sample's distribution exactly (without needing any hypothesis on underlying population distribution) as follows.
You need a file structure that's rapidly searchable for "highest entry with key <= X" -- Sleepycat's Berkeley database has a btree structure for that, for example; SQLite is even easier though maybe not quite as fast (but with an index on the key it should be OK).
Put your data in the form of pairs where the key is the cumulative count up to that point (sorted by increasing value). Call K the highest key.
To generate a random pair that follows exactly the same distribution as the sample, generate a random integer X between 0 and K and look it up in that file structure with the mentioned "highest that's <=" and use the corresponding value.
Not sure how to do all this in R -- in your shoes I'd try a Python/R bridge, do the logic and control in Python and only the statistics in R itself, but, that's a personal choice!

To see whether you have a real power law distribution, make a log-log plot of frequencies and see whether they line up roughly on a straight line. If you do have a straight line, you might want to read this article on the Pareto distribution for more on how to describe your data.

I'm assuming that you're interested in understanding the distribution over your categorical values.
The best way to generate "new" data is to sample from your existing data using R's sample() function. This will give you values which follow the probability distribution indicated by your existing counts.
To give a trivial example, let's assume you had a file of voter data for a small town, where the values are voters' political affiliations, and counts are number of voters:
affils <- as.factor(c('democrat','republican','independent'))
counts <- c(552,431,27)
## Simulate 20 new voters, sampling from affiliation distribution
new.voters <- sample(affils,20, replace=TRUE,prob=counts)
new.counts <- table(new.voters)
In practice, you will probably bring in your 100m rows of values and counts using R's read.csv() function. Assuming you've got a header line labeled "values\t counts", that code might look something like this:
dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)
One caveat: as you may know, R keeps all of its objects in memory, so be sure you've got enough freed up for 100m rows of data (storing character strings as factors will help reduce the footprint).

Related

R - select cases so that the mean of a variable is some given number

I previously worked on a project where we examined some sociological data. I did the descriptive statistics and after several months, I was asked to make some graphs from the stats.
I made the graphs, but something seemed odd and when I compared the graph to the numbers in the report, I noticed that they are different. Upon investigating further, I noticed that my cleaning code (which removed participants with duplicate IDs) now results with more rows, e.g. more participants with unique IDs than previously. I now have 730 participants, whereas previously there were 702 I don't know if this was due to updates of some packages and unfortunately I cannot post the actual data here because it is confidential, but I am trying to find out who these 28 participants are and what happened in the data.
Therefore, I would like to know if there is a method that allows the user to filter the cases so that the mean of some variables is a set number. Ideally it would be something like this, but of course I know that it's not going to work in this form:
iris %>%
filter_if(mean(.$Petal.Length) == 1.3)
I know that this was an incorrect attempt but I don't know any other way that I would try this, so I am looking for help and suggestions.
I'm not convinced this is a tractable problem, but you may get somewhere by doing the following.
Firstly, work out what the sum of the variable was in your original analysis, and what it is now:
old_sum <- 702 * old_mean
new_sum <- 730 * new_mean
Now work out what the sum of the variable in the extra 28 cases would be:
extra_sum <- new_sum - old_sum
This allows you to work out the relative proportions of the sum of the variable from the old cases and from the extra cases. Put these proportions in a vector:
contributions <- c(extra_sum/new_sum, old_sum/new_sum)
Now, using the functions described in my answer to this question, you can find the optimal solution to partitioning your variable to match these two proportions. The rows which end up in the "extra" partition are likely to be the new ones. Even if they aren't the new ones, you will be left with a sample that has a mean that differs from your original by less than one part in a million.

Having difficulty using R programming to implement a trading strategy using multiple securities

I am currently attempting to implement a trading idea that I have been playing around with. It consists of 50+ securities and has a strategy very similar to this one. (Current package I am using is quantmod).
http://www.r-bloggers.com/backtesting-a-simple-stock-trading-strategy/
For those who aren't interested in clicking, it is a strategy that will look at the pass X days( in his case 200 ) and enter a position depending on the peak reached in the stock. I understand how to do this strategy for my idea, but I cannot grasp how to aggregate my data into one summary.
Is there a way I can consolidate the summary for all the positions I have entered into one larger portfolio summary and chart that against the S&P 500?
Any advice on where I can find resources or being lead to the information. I have looked at portfolio analysis package for R and I do not believe that will be much help to me.
Thank you in advance.
Edit: In the link, at the bottom, there are 3 indexes that are FTSE, N225, DJIA. Could i combine those 3 summaries to show the same output as below, BUT combined
FTSE:
Me Index
Cumulative Return 3.56248582 3.8404476
Annual Return 0.05667121 0.0589431
Annualized Sharpe Ratio 0.45907768 0.3298633
Win % 0.53216374 0.5239884
Annualized Volatility 0.12344579 0.1786895
Maximum Drawdown -0.39653398 -0.5256991
Max Length Drawdown 1633.00000 2960.0000
Could I get that same output but for the 3 securities data combined? Is there a effective way of doing that. Thank you so much. Happy holidays
It's a little unclear to me what you mean by "combine" in this case. If you want a single column representing the combined returns from all three exchanges as if they were a single unified market, that's really tricky, because the exchanges trade in different currencies (British pounds; U.S. dollars, Japanese Yen, etc.). The underlying analysis would have to be modified substantially to take into account fluctuating daily foreign exchange rates.
I suspect that this is NOT want you want. Rather, you are simply asking how to take three sequential two-column outputs and turn them into a single parallel six-column output.
If that is indeed what you want, then you need to rewrite the testStrategy() function shown near the bottom of the link. As it's currently written, that function takes three inputs: an index name myStock (with allowed values of FTSE, DJIA, or N225), and two integer values, nHold and nHigh. You would need to change it so that it instead accepts five inputs; e.g., myStockA, myStockB and myStockC, plus the two integer values already mentioned. Then each of the lines currently referring to myStock would have to be replicated three times. Finally, the two cbind() lines that you see at the bottom would have to be modified so that instead of merging the data together into only two columns, you include all six.
For a good intro tutorial on how to write and modify your own R functions, please see this. To understand how to use the cbind() function, which you will have to call with six rather than two inputs, please see this.

calculating DFT coefficient

I have a time series and it contains 256 integer values. It looks like this:
I have calculated STFT( Short Time Forier Transform) whit this code in r:
s<-stft(datalist, win=min(80,floor(length(datalist)/10)), inc=min(24,floor(length(datalist)/30)), coef=256, wtype="hanning.window")
as a resault I have a matrix with 29 rows and 256 values. If I show one row of this matrix in a plot( i.e 10th row). I am seeing such a diagramm:
but I have this expectation, that the coefficient diagram should look like the first diagramm?(only in another dimension)
should I use another package in R to do this job? or my understood is false?
I guess you are using the stft function from package GENEArerad. In your case, the call is basically
s<-stft(datalist, win=25, inc=8, coef=256, wtype="hanning.window")
So the way I read it, you are taking 25 samples but computing 256 coefficients from this. The documentation states that the maximal (reasonable) value for coef is win/2, due to the Nyquist-Shannon sampling theorem I guess. So all but the first 12 or so coefficients will be mostly bogus. And those first few coefficients are off the scale of your plot, so we can't say anything about these either.
I don't know where your expectation did come from, and I don't share it. But I also believe there are some more fundamental problems with how you expect this to work.

Multiple events in traminer

I'm trying to analyse multiple sequences with TraMineR at once. I've had a look at seqdef but I'm struggling to understand how I'd create a TraMineR dataset when I'm dealing with multiple variables. I guess I'm working with something similar to the dataset used by Aassve et al. (as mentioned in the tutorial), whereby each wave has information about several states (e.g. children, marriage, employment). All my variables are binary. Here's an example of a dataset with three waves (D,W2,W3) and three variables.
D<-data.frame(ID=c(1:4),A1=c(1,1,1,0),B1=c(0,1,0,1),C1=c(0,0,0,1))
W2<-data.frame(A2=c(0,1,1,0),B2=c(1,1,0,1),C2=c(0,1,0,1))
W3<-data.frame(A3=c(0,1,1,0),B3=c(1,1,0,1),C3=c(0,1,0,1))
L<-data.frame(D,W2,W3)
I may be wrong but the material I found deals with the data management and analysis of one variable at a time only (e.g. employment status across several waves). My dataset is much larger than the above so I can't really impute these manually as shown on page 48 of the tutorial. Has anyone dealt with this type of data using TraMineR (or similar package)?
1) How would you feed the data above to TraMineR?
2) How would you compute the substitution costs and then cluster them?
Many thanks
When using sequence analysis, we are interested in the evolution of one variable (for instance, a sequence of one variable across several waves). You have then multiple possibilities to analyze several variables:
Create on sequences per variable and then analyze the links between the cluster of sequences. In my opinion, this is the best way to go, if your variables measure different concepts (for instance, family and employment).
Create a new variable for each wave that is the interaction of the different variables of one wave using the interaction function. For instance, for wave one, use L$IntVar1 <- interaction(L$A1, L$B1, L$C1, drop=T) (use drop=T to remove unused combination of answers). And then analyze the sequence of this newly created variable. In my opinion, this is the prefered way if your variables are different dimensions of the same concept. For instance, marriage, children and union are all related to familly life.
Create one sequence object per variable and then use seqdistmc to compute the distance (multi-channel sequence analysis). This is equivalent to the previous method depending on how you will set substitution costs (see below).
If you use the second strategy, you could use the following substitution costs. You can count the differences between the original variable to set the substition costs. For instance, between states "Married, Child" and "Not married and Child", you could set the substitution to "1" because there is only a difference on the "marriage" variable. Similarly, you would set the substition cost between states "Married, Child" and "Not married and No Child" to "2" because all of your variables are different. Finally, you set the indel cost to half the maximum substitution cost. This is the strategy used by seqdistmc.
Hope this helps.
In Biemann and Datta (2013) they talk about multi dimensional analysis. That means creating multiple sequences for the same "individuals".
I used the following approach to do so:
1) define 3 dimensional sequences
comp.seq <- seqdef(comp,NULL,states=comp.scodes,labels=comp.labels, alphabet=comp.alphabet,missing="Z")
titles.seq <- seqdef(titles,NULL,states=titles.scodes,labels=titles.labels, alphabet=titles.alphabet,missing="Z")
member.seq <- seqdef(member,NULL,states=member.scodes,labels=member.labels, alphabet=member.alphabet,missing="Z")
2) Compute the multi channel (multi dimension) distance
mcdist <- seqdistmc(channels=list(comp.seq,member.seq,titles.seq),method="OM",sm=list("TRATE","TRATE","TRATE"),with.missing=TRUE)
3) cluster it with ward's method:
library(cluster)
clusterward<- agnes(mcdist,diss=TRUE,method="ward")
plot(clusterward,which.plots=2)
Nevermind the parameters like "missing" or "left" and etc. but i hope the brief code sample helps.

Movement data analysis in R; Flights and temporal subsampling

I want to analyse angles in movement of animals. I have tracking data that has 10 recordings per second. The data per recording consists of the position (x,y) of the animal, the angle and distance relative to the previous recording and furthermore includes speed and acceleration.
I want to analyse the speed an animal has while making a particular angle, however since the temporal resolution of my data is so high, each turn consists of a number of minute angles.
I figured there are two possible ways to work around this problem for both of which I do not know how to achieve such a thing in R and help would be greatly appreciated.
The first: Reducing my temporal resolution by a certain factor. However, this brings the disadvantage of losing possibly important parts of the data. Despite this, how would I be able to automatically subsample for example every 3rd or 10th recording of my data set?
The second: By converting straight movement into so called 'flights'; rule based aggregation of steps in approximately the same direction, separated by acute turns (see the figure). A flight between two points ends when the perpendicular distance from the main direction of that flight is larger than x, a value that can be arbitrarily set. Does anyone have any idea how to do that with the xy coordinate positional data that I have?
It sounds like there are three potential things you might want help with: the algorithm, the math, or R syntax.
The algorithm you need may depend on the specifics of your data. For example, how much data do you have? What format is it in? Is it in 2D or 3D? One possibility is to iterate through your data set. With each new point, you need to check all the previous points to see if they fall within your desired column. If the data set is large, however, this might be really slow. Worst case scenario, all the data points are in a single flight segment, meaning you would check the first point the same number of times as you have data points, the second point one less, etc. The means n + (n-1) + (n-2) + ... + 1 = n(n-1)/2 operations. That's O(n^2); the operating time could have quadratic growth with respect to the size of your data set. Hence, you may need something more sophisticated.
The math to check whether a point is within your desired column of x is pretty straightforward, although maybe more sophisticated math could help inform a better algorithm. One approach would be to use vector arithmetic. To take an example, suppose you have points A, B, and C. Your goal is to see if B falls in a column of width x around the vector from A to C. To do this, find the vector v orthogonal to C, then look at whether the magnitude of the scalar projection of the vector from A to B onto v is less than x. There is lots of literature available for help with this sort of thing, here is one example.
I think this is where I might start (with a boolean function for an individual point), since it seems like an R function to determine this would be convenient. Then another function that takes a set of points and calculates the vector v and calls the first function for each point in the set. Then run some data and see how long it takes.
I'm afraid I won't be of much help with R syntax, although it is on my list of things I'd like to learn. I checked out the manual for R last night and it had plenty of useful examples. I believe this is very doable, even for an R novice like myself. It might be kind of slow if you have a big data set. However, with something that works, it might also be easier to acquire help from people with more knowledge and experience to optimize it.
Two quick clarifying points in case they are helpful:
The above suggestion is just to start with the data for a single animal, so when I talk about growth of data I'm talking about the average data sample size for a single animal. If that is slow, you'll probably need to fix that first. Then you'll need to potentially analyze/optimize an algorithm for processing multiple animals afterwards.
I'm implicitly assuming that the definition of flight segment is the largest subset of contiguous data points where no "sub" flight segment violates the column rule. That is to say, I think I could come up with an example where a set of points satisfies your rule of falling within a column of width x around the vector to the last point, but if you looked at the column of width x around the vector to the second to last point, one point wouldn't meet the criteria anymore. Depending on how you define the flight segment then (e.g. if you want it to be the largest possible set of points that meet your condition and don't care about what happens inside), you may need something different (e.g. work backwards instead of forwards).

Resources