I am using gnuplot and the function fitting facilities to perform least squares fitting to some of my data.
I have many data points (sometimes tens of millions) and hence fitting to all data points is impossible. (Or at least too slow to be practical.)
It is possible to plot data points with the keyword every (EDIT: Should be pointinterval not every!) followed by an integer, N, to plot only every other Nth point.
eg plot 'data.csv' using 1:2 pointinterval 1000 plots every thousandth data point. Useful for when plotting 10's of millions of points - you can't see anything useful otherwise.
Is there a similar way of doing this with fitting, ie, fit only every 1000'th point?
I tried fit 'data.csv' f(x) using 1:2 pointinterval 1000 via a,b where a and b are parameters of my f(x) - but I just get an error: ';' expected.
I also tried googling this and reading the documentation for gnuplot plotting but didn't find anything.
Alternatively, I could change my program code to only write every 1000th point to datafile, but then I will have to have 2 lots of datafiles - one with all the points and one with 1 in every 1000 data points... which seems kind of wasteful.
Edit: I am not sure why I thought every was the correct syntax for this. Turns out it should be pointinterval (pi short) followed by an integer.
However, this only works for plotting, not function fitting, so the question is still open.
Note for future: use every syntax
Related
Working with code that describes a poisson cluster process in spatstat. Breaking down each line of code one at a time to understand. Easy to begin.
library(spatstat)
lambda<-100
win<-owin(c(0,1),c(0,1))
n.seeds<-lambda*win$xrange[2]*win$yrange[2]
Once the window is defined I then generate my points using a random generation function
x=runif(min=win$xrange[1],max=win$xrange[2],n=pmax(1,n.seeds))
y=runif(min=win$yrange[1],max=win$yrange[2],n=pmax(1,n.seeds))
This can be plotted straight away I know using the ppp function
seeds<-ppp(x=x,
y=y,
window=win)
plot(seeds)
The next line I add marks to the ppp object, it is apparently describing the angle of rotation of the points, I don't understand how this works right now but that is okay, I will figure out later.
marks<-data.frame(angles=runif(n=pmax(1,n.seeds),min=0,max=2*pi))
seeds1<-ppp(x=x,
y=y,
window=win,
marks=marks)
The first problem I encounter is that an objects called pops, describing the populations of the window, is added to the ppp object. I understand how the values are derived, it is a poisson distribution given the input value mu, which can be any value and the total number of observations equal to points in the window.
seeds2<-ppp(x=x,
y=y,
window=win,
marks=marks,
pops=rpois(lambda=5,n=pmax(1,n.seeds)))
My first question is, how is it possible to add a variable that has no classification in the ppp object? I checked the ppp documentation and there is no mention of pops.
The second question I have is about using double variables, the next line requires an sapply function to define dimensions.
dim1<-pmax(1,sapply(seeds1$marks$pops, FUN=function(x)rpois(n=1,sqrt(x))))
I have never seen the $ function being used twice, and seeds2$marks$pop returns $ operator is invalid for atomic vectors. Could you explain what is going on here?
Many thanks.
That's several questions - please ask one question at a time.
From your post it is not clear whether you are trying to understand someone else's code, or developing code yourself. This makes a difference to the answer.
Just to clarify, this code does not come from inside the spatstat package; it is someone's code using the spatstat package to generate data. There is code in the spatstat package to generate simulated realisations of a Poisson cluster process (which is I think what you want to do), and you could look at the spatstat code for rPoissonCluster to see how it can be done correctly and efficiently.
The code you have shown here has numerous errors. But I will start by answering the two questions in your title.
The rules for creating ppp objects are set out in the help file for ppp. The help says that if the argument window is given, then unmatched arguments ... are ignored. This means that in the line seeds2<-ppp(x=x,y=y,window=win,marks=marks,pops=rpois(lambda=5,n=pmax(1,n.seeds)))
the argument pops will be ignored.
The idiom sapply(seeds1$marks$pops, FUN=f) is perfectly valid syntax in R. If the object seeds1 is a structure or list which has a component named marks, which in turn is a structure or list which has a component named pops, then the idiom seeds1$marks$pops would extract it. This has nothing particularly to do with sapply.
Now turning to errors in the code,
The line n.seeds<-lambda*win$xrange[2]*win$yrange[2] is presumably meant to calculate the expected number of cluster parents (cluster seeds) in the window. This would only work if the window is a rectangle with bottom left corner at the origin (0,0). It would be safer to write n.seeds <- lambda * area(win).
However, the variable n.seeds is used later as it it were the number of cluster parents (cluster seeds). The author has forgotten that the number of seeds is random with a Poisson distribution. So, the more correct calculation would be n.seeds <- rpois(1, lambda * area(win))
However this is still not correct because cluster parents (seed points) outside the window can also generate offspring points inside the window. So, seed points must actually be generated in a larger window obtained by expanding win. The appropriate command used inside spatstat to generate the cluster parents is bigwin <- grow.rectangle(Frame(win), cluster_diameter) ; Parents <- rpoispp(lambda, bigwin)
The author apparently wants to assign two mark values to each parent point: a random angle and a random number pops. The correct way to do this is to make the marks a data frame with two columns, for example marks(seeds1) <- data.frame(angles=runif(n.seeds, max=2*pi), pops=rpois(n.seeds, 5))
I am trying to (eventually) plot data by groups, using the prodlim function.
I'm adjusting and adapting code that someone else (not available for questions) has written, and I'm not very familiar with the prodlim library/function. There are definitely other ways to do what I'd like to, but I'm trying to keep it consistent with what the previous person did.
I have code that works, when dividing the data into 2 groups, but when I try to adjust for a 4 group situation, I get an error.
Of note, the data is coming over from SAS using StatTransfer, which has been working fine.
I am new to coding, but I have compared the dataframes I'm trying to work with. The second is just a subset of the first (where the code does work), with all the same variables, and both of the variables I'm trying to group by are integer values.
Hist(medpop$dz_time, medpop$dz_status) works just fine, so the problem must be with the prodlim function, and I haven't understood much of what I've looked up about it, sadly :/ But it the documentation seems to indicate it supports continuous or categorical variables, and doesn't seem limited to binary either. None of the options seem applicable as I understand them.
this works:
M <- prodlim(Hist(dz_time, dz_status)~med, data=pop)
where med is a binary value =1 when a member of this population is taking it, and dz is a disease that some portion develop.
this does not:
(either of these get the error as below)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=medpop)
N <- prodlim(Hist(dz_time, dz_status)~strength, data=pop, subset=pop$med==1)
medpop = the subset of the original population taking the med,
strength = categorical variable ("1","2","3","4")
For the line that does work, the next step is just plot(M), giving a plot with two lines, med==0 and med==1 (showing cumulative incidence of dz_status by dz_time).
For the other line, I get an error saying
Error in KernSmooth::dpik(cumtabx/N, kernel = "box") :
scale estimate is zero for input data
I don't know what that means or how to fix it.. :/
thank you kindly for your time.
I'm merely trying to plot a simple time series data set, but am running into a number of basic issues (one of which I'll ask here). For example, I have a notepad file that starts with:
"x"
"1",2.731
"2",2.562
"3",2.632
"4",2.495
"5",1.978
...and so on...
So R reads it just fine, e.g. myfile=read.table("F:/Documents/myfile.txt",sep=""). However, the values seem to change under a conversion using R's ts function, i.e.
myfile = ts(myfile,start=1,end=120,frequency=1)
plot(myfile, type="o",pch=22,lty=1,pty=2,xlab="Month",ylab="Values",main="My File")
So when plotted, the first value starts at 20+ for some reason, as opposed to 2+. Furthermore, R assumes that the y-axis goes from 1 to 120 (mirroring the x-axis), which is not the right scale (i.e. 0 through 10). In another data set that I did (using integers), it was shifted upward by 1. In any event, I believe the issue is probably about how to properly identifying the y-axis.
Any ideas on how to tackle this? Thanks!
I want to analyse angles in movement of animals. I have tracking data that has 10 recordings per second. The data per recording consists of the position (x,y) of the animal, the angle and distance relative to the previous recording and furthermore includes speed and acceleration.
I want to analyse the speed an animal has while making a particular angle, however since the temporal resolution of my data is so high, each turn consists of a number of minute angles.
I figured there are two possible ways to work around this problem for both of which I do not know how to achieve such a thing in R and help would be greatly appreciated.
The first: Reducing my temporal resolution by a certain factor. However, this brings the disadvantage of losing possibly important parts of the data. Despite this, how would I be able to automatically subsample for example every 3rd or 10th recording of my data set?
The second: By converting straight movement into so called 'flights'; rule based aggregation of steps in approximately the same direction, separated by acute turns (see the figure). A flight between two points ends when the perpendicular distance from the main direction of that flight is larger than x, a value that can be arbitrarily set. Does anyone have any idea how to do that with the xy coordinate positional data that I have?
It sounds like there are three potential things you might want help with: the algorithm, the math, or R syntax.
The algorithm you need may depend on the specifics of your data. For example, how much data do you have? What format is it in? Is it in 2D or 3D? One possibility is to iterate through your data set. With each new point, you need to check all the previous points to see if they fall within your desired column. If the data set is large, however, this might be really slow. Worst case scenario, all the data points are in a single flight segment, meaning you would check the first point the same number of times as you have data points, the second point one less, etc. The means n + (n-1) + (n-2) + ... + 1 = n(n-1)/2 operations. That's O(n^2); the operating time could have quadratic growth with respect to the size of your data set. Hence, you may need something more sophisticated.
The math to check whether a point is within your desired column of x is pretty straightforward, although maybe more sophisticated math could help inform a better algorithm. One approach would be to use vector arithmetic. To take an example, suppose you have points A, B, and C. Your goal is to see if B falls in a column of width x around the vector from A to C. To do this, find the vector v orthogonal to C, then look at whether the magnitude of the scalar projection of the vector from A to B onto v is less than x. There is lots of literature available for help with this sort of thing, here is one example.
I think this is where I might start (with a boolean function for an individual point), since it seems like an R function to determine this would be convenient. Then another function that takes a set of points and calculates the vector v and calls the first function for each point in the set. Then run some data and see how long it takes.
I'm afraid I won't be of much help with R syntax, although it is on my list of things I'd like to learn. I checked out the manual for R last night and it had plenty of useful examples. I believe this is very doable, even for an R novice like myself. It might be kind of slow if you have a big data set. However, with something that works, it might also be easier to acquire help from people with more knowledge and experience to optimize it.
Two quick clarifying points in case they are helpful:
The above suggestion is just to start with the data for a single animal, so when I talk about growth of data I'm talking about the average data sample size for a single animal. If that is slow, you'll probably need to fix that first. Then you'll need to potentially analyze/optimize an algorithm for processing multiple animals afterwards.
I'm implicitly assuming that the definition of flight segment is the largest subset of contiguous data points where no "sub" flight segment violates the column rule. That is to say, I think I could come up with an example where a set of points satisfies your rule of falling within a column of width x around the vector to the last point, but if you looked at the column of width x around the vector to the second to last point, one point wouldn't meet the criteria anymore. Depending on how you define the flight segment then (e.g. if you want it to be the largest possible set of points that meet your condition and don't care about what happens inside), you may need something different (e.g. work backwards instead of forwards).
I've got about 100M value/count pairs in a text file on my Linux machine. I'd like to figure out what sort of formula I would use to generate more pairs that follow the same distribution.
From a casual inspection, it looks power law-ish, but I need to be a bit more rigorous than that. Can R do this easily? If so, how? Is there something else that works better?
While a bit costly, you can mimic your sample's distribution exactly (without needing any hypothesis on underlying population distribution) as follows.
You need a file structure that's rapidly searchable for "highest entry with key <= X" -- Sleepycat's Berkeley database has a btree structure for that, for example; SQLite is even easier though maybe not quite as fast (but with an index on the key it should be OK).
Put your data in the form of pairs where the key is the cumulative count up to that point (sorted by increasing value). Call K the highest key.
To generate a random pair that follows exactly the same distribution as the sample, generate a random integer X between 0 and K and look it up in that file structure with the mentioned "highest that's <=" and use the corresponding value.
Not sure how to do all this in R -- in your shoes I'd try a Python/R bridge, do the logic and control in Python and only the statistics in R itself, but, that's a personal choice!
To see whether you have a real power law distribution, make a log-log plot of frequencies and see whether they line up roughly on a straight line. If you do have a straight line, you might want to read this article on the Pareto distribution for more on how to describe your data.
I'm assuming that you're interested in understanding the distribution over your categorical values.
The best way to generate "new" data is to sample from your existing data using R's sample() function. This will give you values which follow the probability distribution indicated by your existing counts.
To give a trivial example, let's assume you had a file of voter data for a small town, where the values are voters' political affiliations, and counts are number of voters:
affils <- as.factor(c('democrat','republican','independent'))
counts <- c(552,431,27)
## Simulate 20 new voters, sampling from affiliation distribution
new.voters <- sample(affils,20, replace=TRUE,prob=counts)
new.counts <- table(new.voters)
In practice, you will probably bring in your 100m rows of values and counts using R's read.csv() function. Assuming you've got a header line labeled "values\t counts", that code might look something like this:
dat <- read.csv('values-counts.txt',sep="\t",colClasses=c('factor','numeric'))
new.dat <- sample(dat$values,100,replace=TRUE,prob=dat$counts)
One caveat: as you may know, R keeps all of its objects in memory, so be sure you've got enough freed up for 100m rows of data (storing character strings as factors will help reduce the footprint).