I would need some help to understand this type of code and the action that happens here. For instance, we take a vector x defined by the integer (8,6,5,4,2,1,9).
The first step of this function would be to check if the condition is given, that the length of this vector is higher than 1. For x, the condition is given.
The next step is to highlight the position of the smallest value in this vector, this is 6. But I dont understand what actually happens in the next steps and why it has to combine it as a vector?
selsort <- function(x) {
if(length(x) > 1) {
mini <- which.min(x)
c(x[mini], selsort(x[-mini])) #selsort() somewhere in here -> recursion
} else x
}
In recursion, there are 2 key cases:
Base case - input produces a result directly
Recursive case - input causes the program to call itself again
In your function, the base case is when the length of x is not greater than 1. When this happens, we just return x. When we reach the base case, we will not be running the function any more times, all it will do is back track through all of the previous recursive cases to finish executing those selsort() calls.
The recursive case is when the length is greater than 1. For this, we combine the smallest value in our vector with the result of selsort() without that smallest value. This will continue until we reach the base case. So, we find the smallest value, remove it from the list, and then repeat with all of the values from the previous run except the one we selected. Once we reach the base case of there only being 1 element left (the largest one), we have no more minimum finding to do, so we just return the last element.
This is called selection sort, because we are specifically selecting 1 element each time (the smallest element). With large data, this is inefficient, but it is a natural way to think about sorting.
There are more efficient sorting algorithms. One nice one that is easy to understand is merge sort: Merge Sort in R
It puts the smallest number at the first position of the vector, removes this entry from the vector and recursively repeats this until the entire vector entries are sorted from smallest to largest number.
Example
In the first step
x <- x1 <- c(8,6,5,4,2,1,9)
the position of the smallest number in the vector is identified by selsort() with the which.min() function. This number is put at the first position. At the same time, this element is removed from the vector. Therefore in the next step one has
x2 <- c(8,6,5,4,2,9)
c(1,selsort(x2))
Now the algorithm searches for the smallest number in x2, which is 2, puts that one on the front and removes it from the vector, leading to:
x3 <- c(8,6,5,4,9)
c(1,c(2,selsort(x3)))
This is repeated until the length of the vector is equal to one. Then there is nothing left to sort and the last number is returned, which is the largest element of the initial vector.
The assignments x1, x2, x3... are mentioned here only to illustrate the sequence of operation of the code. This is done implicitly in the recursive function which uses only one vector x and reduces it by one entry at each iteration.
Hope this helps.
Related
I have a customer who sends electronic payments but doesn't bother to specify which invoices. I'm left guessing which ones and I would rather not try every single combination manually. I need some sort of pseudo-code to do it and then I can adapt it but I'm not sure I can come up with a good algorithm myself. . I'm familiar with php, bash, and python but I can adapt.
I would need an array with the following numbers: [357.15, 223.73, 106.99, 89.96, 312.39, 120.00]. Those are the amounts of the invoices. Then I would need to find a sum of any combination of two or more of those numbers that adds up to 596.57. Once found the program would need to tell me exactly which numbers it used to reach the sum so I can then know which invoices got paid.
This is very similar to the Subset Sum problem and can be solved using a similar approach to the typical brute-force method used for that problem. I have to do this often enough that I keep a simple template of this algorithm handy for when I need it. What is posted below is a slightly modified version1.
This has no restrictions on whether the values are integer or float. The basic idea is to iterate over the list of input values and keep a running list of every subset that sums to less than the target value (since there might be a later value in the inputs that will yield the target). It could be modified to handle negative values as well by removing the rule that only keeps candidate subsets if they sum to less than the target. In that case, you'd keep all subsets, and then search through them at the end.
import copy
def find_subsets(base_values, taget):
possible_matches = [[0, []]] # [[known_attainable_value, [list, of, components]], [...], ...]
matches = [] # we'll return ALL subsets that sum to `target`
for base_value in base_values:
temp = copy.deepcopy(possible_matches) # Can't modify in loop, so use a copy
for possible_match in possible_matches:
new_val = possible_match[0] + base_value
if new_val <= target:
new_possible_match = [new_val, possible_match[1]]
new_possible_match[1].append(base_value)
temp.append(new_possible_match)
if new_val == target:
matches.append(new_possible_match[1])
possible_matches = temp
return matches
find_subsets([list, of input, values], target_sum)
This is a very inefficient algorithm and it will blow up quickly as the size of the input grows. The Subset Sum problem is NP-Complete, so you are not likely to find a generalized solution that will work in all cases and is efficient.
1: The way lists are being used here is kludgy. If the goal was to simply find any match, the nested lists could be replaced with a dictionary, and we could exit right away once a match is found. But doing that will cause intermediate subsets that sum to the same value to also map to the same dictionary slot, so only one subset with that sum is kept. Since we need to report all matching subsets (because the values represent checks and are presumably not fungible even if the dollar amounts are equal), a dictionary won't work.
You can use itertools.combinations(t,r) to list all combinations of r elements in array t.
So we loop on the possible values of r, then on the results of itertools.combinations:
import itertools
def find_sum(t, obj):
t = [x for x in t if x < obj] # filter out elements which are too big
for r in range(1, len(t)+1): # loop on number of elements
for subt in itertools.combinations(t, r): # loop on combinations of r elements
if sum(subt) == obj:
return subt
return None
find_sum([1,2,3,4], 6)
# (2, 4)
find_sum([1,2,3,4], 10)
# (1, 2, 3, 4)
find_sum([1,2,3,4], 11)
# none
find_sum([35715, 22373, 10699, 8996, 31239, 12000], 59657)
# none
Rounding errors:
The code above is meant to be used with integers, rather than floats.
To use with floats, replace the test sum(subt) == obj with the more forgiving test sum(subt) - obj < 0.01.
Relevant documentation:
itertools.combinations
I've tried a couple ways of doing this problem but am having trouble with how to write it. I think I did the first three steps correctly, but now I have to fill the vector z with numbers from y that are divisible by four, not divisible by three, and have an odd number of digits. I know that I'm using the print function in the wrong way, I'm just at a loss on what else to use ...
This is different from that other question because I'm not using a while loop.
#Step 1: Generate 1,000,000 random, uniformly distributed numbers between 0
#and 1,000,000,000, and name as a vector x. With a seed of 1.
set.seed(1)
x=runif(1000000, min=0, max=1000000000)
#Step 2: Generate a rounded version of x with the name y
y=round(x,digits=0)
#Step 3: Empty vector named z
z=vector("numeric",length=0)
#Step 4: Create for loop that populates z vector with the numbers from y that are divisible by
#4, not divisible by 3, with an odd number of digits.
for(i in y) {
if(i%%4==0 && i%%3!=0 && nchar(i,type="chars",allowNA=FALSE,keepNA=NA)%%2!=0){
print(z,i)
}
}
NOTE: As per #BenBolker's comment, a loop is an inefficient way to solve your problem here. Generally, in R, try to avoid loops where possible to maximise the efficiency of your code. #SymbolixAU has provided an example of doing so here in the comments. Having said that, in aid of helping you learn the ins-and-outs of loops and vectors, here's a solution which only requires a change to one line of your code:
You've got the vector created before the loop, that's a good start. Now, inside your loop, you need to populate that vector. To do so, you've currently got print(z,i), which won't really do too much. What you need to to change the vector itself:
z <- c( z, i )
Should work for you (just replace that print line in your loop).
What's happening here is that we're taking the existing z vector, binding i to the end of it, and making that new vector z again. So every time a value is added, the vector gets a little longer, such that you'll end up with a complete vector.
where you have print put this instead:
z <- append(z, i)
Say I have a vector defined a= rep(NA, 10);
I want to give its ith element a value for each iteration.
for(i in 1:10){
indexUsed[i] = largestGradient(X, y, indexUsed[is.na(indexUsed)], score)
}
as you see, I want use index[1:(i-1)] to calculate ith element, but for the first element, I want a NULL or whatever, special value there to let my function knows that it is empty (then it will handles this in the case for assigning value to the first element which is different from the next steps).
I do not know my writing is a good way to do that, usually how you do?
I don't have a better way of doing this than with a for loop, but would love to see other people's responses. However, it does seem to me that your code should read
indexUsed[i] <- largestGradient(X, y, indexUsed[!is.na(indexUsed)], score)
For i=1, your indexUsed[!is.na(indexUsed)] will be empty, and should be your based case in your function. For every other iteration, it will retrieve elements 1 through i-1.
I have three data sources:
types<-c(1,3,3)
places<-list(c(1,2,3),1,c(2,3))
lookup.counts<-as.data.frame(matrix(runif(9,min=0,max=10),nrow=3,ncol=3))
assigned.places<-rep.int(0,length(types))
the numbers in the "types" vector tell me what 'type' a given observation is. The vectors in the places list tell me which places the observation can be found in (some observations are found in only one place, others in all places). By definition there is one entry in types and one list in places for each observation. Lookup.counts tells me how many observations of each type are located in each place (generated from another data source).
I want to randomly assign each observation to a place based on a probability generated from lookup.counts. Using for loops it looks something like"
for (i in 1:length(types)){
row<-types[i]
columns<-places[[i]]
this.obs<-lookup.counts[row,columns] #the counts of this type in each place
total<-sum(this.obs)
this.obs<-this.obs/total #the share of observations of this type in these places
pick<-runif(1,min=0,max=1)
#the following should really be a 'while' loop, but regardless it needs help
for(j in 1:length(this.obs[])){
if(this.obs[j] > pick){
#pick is less than this county so assign
pick<- 100 #just a way of making sure an observation doesn't get assigned twice
assigned.places[i]<-colnames(lookup.counts)[j]
}else{
#pick is greater, move to the next category
pick<- pick-this.obs[j]
}
}
}
I have been trying to vectorize this somehow, but am getting hung up on the variable length of 'places' and of 'this.obs'
In practice, of course, the lookup.counts table is quite a bit bigger (500 x 40) and I have some 900K observations with places lists of length 1 through length 39.
To vectorize the inner loop, you can use sample or sample.int to choose from several alternaives with prescribed probabilities. Unless I read your code incorrectly, you want something like this:
assigned.places[i] <- sample(colnames(this.obs), 1, prob = this.obs)
I'm a bit surprised that you're using colnames(lookup.counts) instead. Shouldn't this be subset by columns as well? It seems that either I missed something, or there is a bug in your code.
the different lengths of your lists are a severe obstacle to vectorizing your outer loops. Perhaps you could use the Matrix package to store that information as sparse matrices. Then you could simply multiply probabilities by that vector to exclude those columns which are not in the places list of a given observation. But as you'd probably still use apply for the above sampling code, you might as well keep the list and use some form of apply to iterate over that.
The overall result might look somewhat like this:
assigned.places <- colnames(lookup.counts)[
apply(cbind(types, places), 1, function(x) {
sample(x[[2]], 1, prob=lookup.counts[x[[1]],x[[2]]])
})
]
The use of cbind and apply isn't particularly beautiful, but seems to work. Each x is a list of two items, x[[1]] being the type and x[[2]] being the corresponding places. We use these to index lookup.counts just as you did. Then we use the found counts as relative probabilities when choosing the index of one of the columns we used in the subscript. Only after all these numbers have been assembled into a single vector by apply will the indices be turned into names based on colnames.
You can check whether things are faster if you don't cbindstuff together, but instead iterate over the indices only:
assigned.places <- colnames(lookup.counts)[
sapply(1:length(types), function(i) {
sample(places[[i]], 1, prob=lookup.counts[types[i],places[[i]]])
})
]
This appears to work as well:
# More convenient if lookup.counts is a matrix.
lookup.counts<-matrix(runif(9,min=0,max=10),nrow=3,ncol=3)
colnames(lookup.counts)<-paste0('V',1:ncol(lookup.counts))
# A function that does what the for loop does for each i
test<-function(i) {
this.places<-colnames(lookup.counts)[places[[i]]]
this.obs<-lookup.counts[types[i],this.places]
sample(this.places,size=1,prob=this.obs)
}
# Applies the function for all i
sapply(1:length(types),test)
I have assignment using R and have a little problem. In the assignment several matrices have to be generated with random number of rows and later used for various calculations. Everything works perfect, unless number of rows is 1.
In the calculations I use nrow(matrix) in different ways, for example if (i <= nrow(matrix) ) {action} and also statements like matrix[,4] and so on.
So in case number of rows is 1 (I know it is actually vector) R give errors, definitely because nrow(1-dimensional matrix)=NULL. Is there simple way to deal with this? Otherwise probably whole code have to be rewritten, but I'm very short in time :(
It is not that single-row/col matrices in R have ncol/nrow set to NULL -- in R everything is a 1D vector which can behave like matrix (i.e. show as a matrix, accept matrix indexing, etc.) when it has a dim attribute set. It seems otherwise because simple indexing a matrix to a single row or column drops dim and leaves the data in its default (1D vector) state.
Thus you can accomplish your goal either by directly recreating dim attribute of a vector (say it is called x):
dim(x)<-c(length(x),1)
x #Now a single column matrix
dim(x)<-c(1,length(x))
x #Now a single row matrix
OR by preventing [] operator from dropping dim by adding drop=FALSE argument:
x<-matrix(1:12,3,4)
x #OK, matrix
x[,3] #Boo, vector
x[,3,drop=FALSE] #Matrixicity saved!
Let's call your vector x. Try using matrix(x) or t(matrix(x)) to convert it into a proper (2D) matrix.