I would like to solve a quite big optimization problem, where time matters, but I stucked with the understanding in the vast amount of "R" packages, so I would like to ask the community directly about this problem.
I want to minimize a function:
F=(x-y)^2
where y is a given, pre defined vector of 8000 values.
So, I'm searching for the 8000 x-es.
I've got a matrix of A (which is basically a dummy variable matrix), with nrow=8, ncol=8000.
I also have a vector b, with 8 given values.
So, I want to want to solve the following problem:
min(x-y)^2
s.t:
A*x=b
Theoretically I understand everything, but somehow I fail to incorporate the F into any package, where equallity constrains are allowed.
Also (and because I've no idea, what will be the processing time), I would like to ask you, what would you do, if:
F= abs(x-y)
because if the minimalization of the quadratic function takes to long, this second option would also satisfy me.
The data is confidential, but privately (and a bit differently) I'll send it, if it's necessary for the solution.
Edit nr.1:
ok, i'll be more specific this time
i've got 2 years of data (that is the 8000 measurment, each year contains 4000 measurments)
each year have q1, q2, q3, q4, which happened somehow in a past (but will be specified as optimum in the future, to achive some goals)
so, this is my b vector, the criteria that the optimization has to meet.
made up numbers
b<-c(20,30,40,50,60,70,80,90)
i have got a matrix A, which is a binary matrix, indicates where are we in the time, q1,q2, etc
let say, that one quarter of the year is 1 days long, so:
(there is 7 zero in a vector, because we are talkin about 2 years here, and only one quarter)
a<-c(1,0,0,0,0,0,0,0)
u<-c(0,1,0,0,0,0,0,0)
c<-c(1,0,1,0,0,0,0,0)
d<-c(0,0,0,1,0,0,0,0)
from this point, another year comes in, with another q1, that is why the binary wont jumps back to the first place
e<-c(0,0,0,0,1,0,0,0)
f<-c(0,0,0,0,0,1,0,0)
g<-c(0,0,0,0,0,0,1,0)
h<-c(0,0,0,0,0,0,0,1)
A<-cbind(a,u,c,d,e,f,g,h)
this is a bit bad way to represent the data, I can trick you, because the length and the width is the
same in the matrix, but remember, in the original data everything is fine for matrix multiplication
the width of A, and the length of x is 8000
there is a planed way, how things sholud go in each Q, that is the "y", which is given.
made up numbers
y<-c(10,11,12,13,14,16,17,18)
so basically, i want to stuck to the plan, as much as I can, but to achive criteria b, that is whay I want to minimize
the differences between the planned and the x values,
min F (Ax-y)^2
s.t: A*x=b
Hope it's clearer,i reduced the dimension of the problem, this way it may
seem unfeasible
(its dumb, i know :)
Looks like there is nothing to optimize. E.g. with your data set:
> b<-c(20,30,40,50,60,70,80,90)
> a<-c(1,0,0,0,0,0,0,0)
> u<-c(0,1,0,0,0,0,0,0)
> c<-c(1,0,1,0,0,0,0,0)
> d<-c(0,0,0,1,0,0,0,0)
>
> e<-c(0,0,0,0,1,0,0,0)
> f<-c(0,0,0,0,0,1,0,0)
> g<-c(0,0,0,0,0,0,1,0)
> h<-c(0,0,0,0,0,0,0,1)
>
> A<-cbind(a,u,c,d,e,f,g,h)
> x <- solve(A,b)
> x
a u c d e f g h
-20 30 40 50 60 70 80 90
It would be more interesting if there were some degrees of freedom left to play with x and make it as close as y as possible,
Related
Background:
On this combinatorics question, the issue is how to determine the sample space: the ways 8 different soccer teams can be paired up for the next round of competition. Two different answers have been advanced for that part of the problem: 28 (see comments OP) and 105 (see edit within OP and answer).
I'd like to do this manually to try to hone down on the mistake in whichever answer is incorrect.
What I have tried:
teams = 1:8
names(teams) = c("RM", "BCN", "SEV", "JUV", "ROM", "MC", "LIV", "BYN")
split(sample(teams), rep(1:(length(teams)/2), each=2))
Unfortunately, the output is a list, and I wanted a vector to be able to run something like:
unique(...,MARGIN=2)
Is there a way of doing this in an elegant manner?
After a now erased answer (thank you), I would go with
a <- replicate(1e5, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2))))
to simulate 100,000 random samples, and later run
unique(a, MARGIN = 2).
But how can I account for the fact that the order of the 4 pairings of opponents doesn't matter, and that LIV-BYN and BYN-LIV, for example, is the same pairing (field advantage notwithstanding)?
> u = ncol(unique(replicate(1e6, unlist(split(sample(teams), rep(1:(length(teams)/2), each=2)))), MARGIN = 2))
> u / (factorial(4) * 2^4)
[1] 105
The idea of unlist is from #Song Zhengyi, and if his answer is un-deleted, I'll accept it. The complete answer is in the lines above.
u needs to be divided by 4! because
BCN-RM, BYN-SEV, JUV-ROM, LIV-MC
is exactly the same as
LIV-MC, BCN-RM, BYN-SEV, JUV-ROM
or
BCN-RM, LIV-MC, BYN-SEV, JUV-ROM
etc.
The term 2^4 is to avoid over-counting since for every possible unique draw, each one of the pairings can be flipped without loss (discarding field advantage): BCN-RM is the same as RM-BCN, and there are 4 pairs in each draw.
If field advantage is a consideration (real life)...
> u/factorial(4)
[1] 1680
we end up with 1,680 possible draws.
As a background, I'm a computer programmer and I'm working on a software library that allows a computer to quickly search through all dates to find a set of dates that satisfies a criteria. For example:
I want a list of every possible time that has ever occurred that has occurred on a friday or a saturday that is in April or May during the first week of the month.
My library uses numerical sets to efficiently represent ranges of dates that satisfy a criteria.
I've been thinking about ways to improve the performance of some parts of the app and I think that by combining sets and some geometry, I can really improve my results. However, my geometry is a bit rusty and I was hoping you might could help.
Here's my thought:
Certain elements of time can be represented as a circular dial. For example, Minutes can be positioned on a clock with values between 0...59. We could store valid ranges as a list of arcs. For example, If we wanted all times that ended with 05..10, we could store [5,10]. If we wanted all times that end with :45-59 or :00-15, we could store [45, 15]. Notice how this last arc "loops around" the dial. Here's a mockup showing different ranges intersecting on a dial
My question is this:
Given a set of whole numbers between N...M arranged into a circle.
Given Arc1 which is representing by [A, B] and Arc2 which is represented by [C, D] where A, B, C, and D are all within in range N...M
How do I determine:
A. Whether the arcs intersect.
B. If they do, what their intersection is.
C. If they do, what their union is.
Thank you so much for your help. If you're not able to help, if you can point me in the right direction, that would be great.
Thanks!
A simple and safe approach is to split the intervals that straddle 0. Then you perform pairwise interval intersection/union (for instance if A < D and C < B then [max(A,C), min(B,D)] for the intersection), and merge them if they meet at 0.
It seems the primitive operation to implement would be something like 'is the number X contained in the arch [A,B]'. Once you have that, you could implement an [A,B]/[C,D] arch-intersection predicate by something like -
Arch intersection means exactly that at least one of the following conditions is met:
C is contained in [A,B]
D is contained in [A,B]
A is contained in [C,D]
B is contained in [C,D]
One way to implement this contained-in-arch test without any branches is with some trigonometry and vector cross product. Not sure it would be faster (the math/branches performance tradeoff is entirely empiric), but it might be worth a try.
Denote Xa = sin(X/N * 2PI), Ya = cos(X/N * 2PI) and similarly for Xb,Yb etc.
C is contained in [A,B] is equivalent to:
Xa * Yc - Ya * Xc > 0
AND
Xc * Yb - Yc * Xb > 0
You can complete the other 3 conditions in an identical manner.
Hope this turns out useful.
In the R programming language...
Bottleneck in my code:
a <- a[b]
where:
a,b are vectors of length 90 Million.
a is a logical vector.
b is a permutation of the indeces of a.
This operation is slow: it takes ~ 1.5 - 2.0 seconds.
I thought straightforward indexing would be much faster, even for large vectors.
Am I simply stuck? Or is there a way to speed this up?
Context:
P is a large matrix (10k row, 5k columns).
rows = names, columns = features. values = real numbers.
Problem: Given a subset of names, I need to obtain matrix Q, where:
Each column of Q is sorted (independently of the other columns of Q).
The values in a column of Q come from the corresponding column of P and are only those from the rows of P which are in the given subset of names.
Here is a naive implementation:
Psub <- P[names,]
Q <- sapply( Psub , sort )
But I am given 10,000 distinct subsets of names (each subset is several 20% to 90% of the total). Taking the subset and sorting each time is incredibly slow.
Instead, I can pre-compute the order vector:
b <- sapply( P , order )
b <- convert_to_linear_index( as.data.frame(b) , dim(P) )
# my own function.
# Now b is a vector of length nrow(P) * ncol(P)
a <- rownames(P) %in% myNames
a <- rep(a , ncol(P) )
a <- a[b]
a <- as.matrix(a , nrow = length(myNames) )
I don't see this getting much faster than that. You can try to write an optimized C function to do exactly this, which might cut the time in half or so (and that's optimistic -- vectorized R operations like this don't have much overhead), but not much more than that.
You've got approx 10^8 values to go through. Each time through the internal loop, it needs to increment the iterator, get the index b[i] out of memory, look up a[b[i]] and then save that value into newa[i]. I'm not a compiler/assembly expert by a long shot, but this sounds like on the order of 5-10 instructions, which means you're looking at "big O" of 1 billion instructions total, so there's a clock rate limit to how fast this can go.
Also, R stores logical values as 32 bit ints, so the array a will take up about 400 megs, which doesn't fit into cache, so if b is a more or less random permutation, then you're going to be missing the cache regularly (on most lookups to a, in fact). Again, I'm not an expert, but I would think it's likely that the cache misses here are the bottleneck, and if that's the case, optimized C won't help much.
Aside from writing it in C, the other thing to do is determine whether there are any assumptions you can make that would let you not go through the whole array. For example, if you know most of the indices will not change, and you can figure out which ones do change, you might be able to make it go faster.
On edit, here are some numbers. I have an AMD with clock speed of 2.8GHz. It takes me 3.4 seconds with a random permutation (i.e. lots of cache misses) and 0.7 seconds with either 1:n or n:1 (i.e. very few cache misses), which breaks into 0.6 seconds of execution time and 0.1 of system time, presumably to allocate the new array. So it does appear that cache misses are the thing. Maybe optimized C code could shave something like 0.2 or 0.3 seconds off of that base time, but if the permutation is random, that won't make much difference.
> x<-sample(c(T,F),90*10**6,T)
> prm<-sample(90*10**6)
> prm1<-1:length(prm)
> prm2<-rev(prm1)
> system.time(x<-x[prm])
user system elapsed
3.317 0.116 3.436
> system.time(x<-x[prm1])
user system elapsed
0.593 0.140 0.734
> system.time(x<-x[prm2])
user system elapsed
0.631 0.112 0.743
>
I want to be able to completely detach a subset (created by tapply) of a dataframe from its parent dataframe. Basically I want R to forget the existing relation and consider the subset dataframe in its own right.
**Following the proposed solution in the comments, I find it does not work for my data. The reason might be that my real dataset is a plm.dim object with an assigned index. I tried this at home for the example dataset and it worked fine. However, once again in my real data, the problem is not solved.
Here's the output of my actual data (original 37 firms)
sum(tapply(p.data$abs_pb_t,p.data$Rfirm,sum)==0)
[1] 7
s.data <- droplevels(p.data[tapply(p.data$abs_pb_t,p.data$ID,sum)!=0,])
sum(tapply(s.data$abs_pb_t,s.data$Rfirm,sum)==0)
[1] 8
Not only is the problem not solved for some reason I get an extra count of a zero variable while I explicitly ask to only keep the ones that differ from zero
Unfortunately, I cannot recreate the same problem with a simple example. For that example, as said, droplevels() works just fine
A simple reproducible example explains:
library(plm)
dad<-cbind(as.data.frame(matrix(seq(1:40),8,5)),factors = c("q","w","e","r"), year = c("1991","1992", "1993","1994"))
dad<-plm.data(dad,index=c("factors","year"))
kid<-dad[tapply(dad$V5,dad$factors,sum)<=70,]
tapply(kid$V1,kid$factors,mean)
kid<-droplevels(dad[tapply(dad$V5,dad$factors,sum)<=70,])
tapply(kid$V1,kid$factors,mean)
So I create a dad and a kid dataframe based on some tapply condition (I'm sure this extends more generally).
the result of the tapply on the kid is the following
e q r w
7 NA 8 NA
Clearly R has not forgotten the dad and it adds that two factors are NA . In itself not much of a problem but in my real dataset which much more variables and subsetting to do, I'd like a cleaner cut so that it will make searching through the kid(s) easier. In other words, I don't want the initial factors q w e r to be remembered. The desired output would thus be:
e r
7 8
So, can anyone think of a reason why what works perfectly in a small data.frame would work differently in a larger dataframe? for p.data (N = 592, T = 16 and n = 37). I find that when I run 2 identical tapply functions, one on s.data and one on p.data, all values are different. So not only have the zeros not disappeared, literally every sum has changed in the s.data which should not be the case. Maybe that gives a clue as to where I go wrong...
And potentially it could solve the mystery of the factors that refuse to drop as well
Thanks
Simon
I'm working on a dataset that consists of ~10^6 values which clustered into a variable number of bins. In the course of my analysis, I am trying to randomize my clustering, but keeping bin size constant. As a toy example (in pseudocode), this would look something like this:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
for (rand in 1:no.of.randomizations) {
rand.data <- partition.sample(seq(1,15), partitions=sizes, replace=F)
}
So, I am looking for a function like "partition.sample" that will take a vector (like seq(1,15)) and randomly sample from it, returning a list with the data partitioned into the right bin sizes given already by "sizes".
I've been trying to write one such function myself, since the task seems to be not so hard. However, the partitioning of a vector into given bin sizes looks like it would be a lot faster and more efficient if done "under the hood", meaning probably not in native R. So I wonder whether I have just missed the name of the appropriate function, or whether someone could please point me to a smart solution that is around :-)
Your help & time are very much appreciated! :-)
Best,
Lymond
UPDATE:
By "no.of.randomizations" I mean the actual number of times I run through the whole "randomization loop". This will, later on, obviously include more steps than just the actual sampling.
Moreover, I would in addition be interested in a trick to do the above feat for sampling without replacement.
Thanks in advance, your help is very much appreciated!
Revised: This should be fairly efficient. It's complexity should be primarily in the permutation step:
# A single step:
x <- sample( unlist(data))
list( one=x[1:4], two=x[5:8], three=x[9], four=x[10:12], five=x[13:16])
As mentioned above the "no.of.randomizations" may have been the number of repeated applications of this proces, in which case you may want to wrap replicate around that:
replic <- replicate(n=4, { x <- sample(unlist(data))
list( x[1:4], x[5:8], x[9], x[10:12], x[13:15]) } )
After some more thinking and googling, I have come up with a feasible solution. However, I am still not convinced that this is the fastest and most efficient way to go.
In principle, I can generate one long vector of a uniqe permutation of "data" and then split it into a list of vectors of lengths "sizes" by going via a factor argument supplied to split. For this, I need an additional ID scheme for my different groups of "data", which I happen to have in my case.
It becomes clearer when viewed as code:
data <- list(c(1,5,6,3), c(2,4,7,8), c(9), c(10,11,15), c(12,13,14));
sizes <- lapply(data, length);
So far, everything as above
names <- c("set1", "set2", "set3", "set4", "set5");
In my case, I am lucky enough to have "names" already provided from the data. Otherwise, I would have to obtain them as (e.g.)
names <- seq(1, length(data));
This "names" vector can then be expanded by "sizes" using rep:
cut.by <- rep(names, times = sizes);
[1] 1 1 1 1 2 2 2 2 3 4 4 4 5
[14] 5 5
This new vector "cut.by" can then by provided as argument to split()
rand.data <- split(sample(1:15, 15), cut.by)
$`1`
[1] 8 9 14 4
$`2`
[1] 10 2 15 13
$`3`
[1] 12
$`4`
[1] 11 3 5
$`5`
[1] 7 6 1
This does the job I was looking for alright. It samples from the background "1:15" and splits the result into vectors of lengths "sizes" through the vector "cut.by".
However, I am still not happy to have to go via an additional (possibly) long vector to indicate the split positions, such as "cut.by" in the code above. This definitely works, but for very long data vectors, it could become quite slow, I guess.
Thank you anyway for the answers and pointers provided! Your help is very much appreciated :-)