I am new to this forum. I guess something like this has been asked before but, I am not really sure if that is what I want.
I have a sequence like this,
1 2 3 4 5 8 9 10 12 14 15 17 18 19
So, what I wish to do is this, get all the numbers which form a series,i.e.the numbers that belonging to that set should all have a constant difference with the previous element, and also the minimum number of elements should be 3 in that set.
i.e., I can see that (1,2,3,4,5) forms one such series in which numbers appear after an interval of 1 and the total size of this set is 5 which satisfies the minimum threshold criteria.
(1,3,5) forms one such a pattern in which the numbers appear after an interval of 2.
(8,10,12,14) forms another such pattern with an interval of 2. So, as you can see, the interval of repetition can be anything.
Also, for a particular set, I want its maximal one. I dont want, (8,10,12) (although it satisfies the minimum threshold of 3 and constant difference ) as the output and only of the maximal length I want, i.e. (8,10,12,14).
Similarly, for, (1,2,3,4,5) , I dont want (1,2,3) or (2,3,4,5) as the output, only the MAXIMAL LENGTH ONE I WANT, i.e. (1,2,3,4,5).
How can I do this in R?
Edit: That is, I want any set which forms a basic AP series with any difference, however the total value should be greater than 3 in that series and it should be maximal.
Edit2: I have tried using rle and acf in R but that doesnt entirely solves my problem.
Edit3: When I did acf, it basically gave me the maximum peak difference that I could have used. However, I want all the differences possible. Also, rle is just way different. It gave me the longest continuous sequence of similar numbers. Which is not there in my case.
If you are looking for sequences of consecutive numbers, then cgwtools::seqle will find them for you in the same way rle finds a sequence of repeated values.
In the general case of basically any subset of your data which form such a sequence, such as the 8,10,12,14 case you cite, your criteria are so general as to be very difficult to satisfy. You'd have to start at each element of your series and do a forward-looking search for x[j] +1, x[j]+2, x[j]+3 ... ad infinitum. This suggests using some tree-based algorithms.
Here's a potential solution - albeit a very ugly, sloppy one:
##
arithSeq <- function(x=nSeq, minSize=4){
##
dx <- diff(x,lag=1)
Runs <- rle(diff(x))
##
rLens <- Runs[[1]]
rVals <- Runs[[2]]
pStart <- c(
rep(1,rLens[1]),
rep(cumsum(1+rLens[-length(rLens)]),times=rLens[-1])
)
pEnd <- pStart + c(
rep(rLens[1]-1, rLens[1]),
rep(rLens[-1],times=rLens[-1])
)
pGrp <- rep(1:length(rLens),times=rLens)
pLen <- rep(rLens, times=rLens)
dAll <- data.frame(
pStart=pStart,
pEnd=pEnd,
pGrp=pGrp,
pLen=pLen,
runVal=rep(rVals,rLens)
)
##
dSub <- subset(dAll, pLen >= minSize - 1)
##
uVals <- unique(dSub$runVal)
##
maxSub <- subset(dSub, runVal==uVals[1])
maxLen <- max(maxSub$pLen)
maxSub <- subset(maxSub, pLen==maxLen)
##
if(length(uVals) > 1){
for(i in 2:length(uVals)){
iSub <- subset(dSub, runVal==uVals[i])
iMaxLen <- max(iSub$pLen)
iSub <- subset(iSub, pLen==iMaxLen)
maxSub <- rbind(
maxSub,
iSub)
maxSub
}
##
}
##
deDup <- maxSub[!duplicated(maxSub),]
seqStarts <- as.numeric(rownames(deDup))
outList <- list(NULL); length(outList) <- nrow(deDup)
for(i in 1:nrow(deDup)){
outList[[i]] <- list(
Sequence = x[seqStarts[i]:(seqStarts[i]+deDup[i,"pLen"])],
Length=deDup[i,"pLen"]+1,
StartPosition=seqStarts[i],
EndPosition=seqStarts[i]+deDup[i,"pLen"])
outList
}
##
return(outList)
##
}
##
So there are things that can definitely be improved in this function - for instance I made a mistake somewhere in the calculation of pStart and pEnd, the start and end indices of a given arithmetic sequence, but it just so happened that the true start positions of such sequences are given as the rownumbers of one of the intermediate data.frames, so that was a hacky sort of solution. Anyways, it accepts a numeric vector x and a minimum length parameter, minSize. It will return a list containing information about sequences meeting the criteria you outlined above.
set.seed(1234)
lSeq <- sample(1:25,100000,replace=TRUE)
nSeq <- c(1:10,12,33,13:17,16:26)
##
> arithSeq(nSeq)
[[1]]
[[1]]$Sequence
[1] 16 17 18 19 20 21 22 23 24 25 26
[[1]]$Length
[1] 11
[[1]]$StartPosition
[1] 18
[[1]]$EndPosition
[1] 28
##
> arithSeq(x=lSeq,minSize=5)
[[1]]
[[1]]$Sequence
[1] 13 16 19 22 25
[[1]]$Length
[1] 5
[[1]]$StartPosition
[1] 12760
[[1]]$EndPosition
[1] 12764
[[2]]
[[2]]$Sequence
[1] 11 13 15 17 19
[[2]]$Length
[1] 5
[[2]]$StartPosition
[1] 37988
[[2]]$EndPosition
[1] 37992
Like I said, its sloppy and inelegant, but it should get you started.
Related
As an exercise I was given two samples from a seed called u and v and asked to show how many values are in v but not in u fell into the bins [1,50] and [51,100]. Then I am asked to add a line of code in to confirm my answer using a relational operator (like >) and sum().
I solved the first part:
table(findInterval(setdiff(v,u),c(50))
But for the second part, i don't really get what I need to do; any help is appreciated!
Example:
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),c(50)))
Output:
0 1
12 12
If we want to use comparative operators and sum, create a logical vector and get the sum of logical vector
i1 <- v[!v %in% u] > 50
sum(i1)
sum(!i1)
Note: If the OP intended to use only unique values (as in setdiff), then get the unique
i1 <- unique(v[!v %in% u]) > 50
out1 <- sum(i1)
out2 <- sum(!i1)
-checking with the output of table
tbl1 <- table(findInterval(setdiff(v,u),c(50)))
all.equal(as.numeric(tbl1), c(out1, out2), check.attributes = FALSE)
#[1] TRUE
Since there is only one number that you are cutting the intervals in, you can verify your answer using > directly.
This is your code
set.seed(1201)
u = sample(100,100,replace=TRUE)
v = sample(100,100,replace=TRUE)
table(findInterval(setdiff(v,u),50))
#0 1
#9 9
Without findInterval
table(setdiff(v,u) > 50)
#FALSE TRUE
# 9 9
Suppose we have a vector:
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
Expected output:
v_index <- c(5,6,7)
v always starts and ends with 0. There is only one possibility of having cluster of zeros between two 1s.
Seems simple enough, can't get my head around...
I think this will do
which(cumsum(v == 1L) == 1L)[-1L]
## [1] 5 6 7
The idea here is to separate all the instances of "one"s to groups and select the first group while removing the occurrence of the "one" at the beginning (because you only want the zeroes).
v <- c(0,0,0,1,0,0,0,1,1,1,0,0)
v_index<-seq(which(v!=0)[1]+1,which(v!=0)[2]-1,1)
> v_index
[1] 5 6 7
Explanation:I ask which indices are not equal to 0:
which(v!=0)
then I take the first and second index from that vector and create a sequence out of it.
This is probably one of the simplest answers out there. Find which items are equal to one, then produce a sequence using the first two indexes, incrementing the first and decrementing the other.
block <- which(v == 1)
start <- block[1] + 1
end <- block[2] - 1
v_index <- start:end
v_index
[1] 5 6 7
I have a column in a data.frame where each observation is a string of numbers (e.g. "1,5,6,7,0,21"). I am attempting to calculate the difference for the first instance of non-consecutive numbers. In the above example the result would be 5 - 1 = 4. However, with the code I currently have I get 6. If my input is "1,2,0,21" I get the correct result of 21 - 2 = 19 (the numbers are sorted before subtraction occurs). I thought maybe the zero was the issue, but adding one to all values did not solve the issue. Perhaps a problem with my indexing? Any suggestions?
# find distance between number in first gap of non-consecutive numbers
b <- c("1,5,6,7,0,21") # does not work as desired result is 6 instead of 4
# b <- ("1,2,0,21") # works as desired
b.Uncomma <- sort(unique(as.numeric(unlist(strsplit(b, split=","))))) # remove commas, remove duplicates, sort
#b.Uncomma <- b.Uncomma + 1 # same result
b.Gaps <- c(which(diff(b.Uncomma) != 1), length(b.Uncomma)) # find where the difference is not 1
b.FirstGap <- b.Gaps[1:2] # get the positions/index on either side of the first gap
b.Result <- b.Uncomma[(b.FirstGap[2])] - b.Uncomma[(b.FirstGap[1])] # subtract to get result
inp <- scan(text=b,sep=",")
#Read 6 items
sinp <- sort(inp)
diff(sinp)
#[1] 1 4 1 1 14
> diff(sinp)[diff(sinp) != 1][1]
#[1] 4
Try:
b.Uncomma <- sort(unique(as.numeric(unlist(strsplit(b, split=","))))) # remove commas, remove duplicates, sort
b.Gaps <- c(which(diff(b.Uncomma) != 1), length(b.Uncomma)) # find where the difference is not 1
b.FirstGap <- b.Gaps[1] # get the positions/index of the first gap
b.Result <- b.Uncomma[(b.FirstGap+1)] - b.Uncomma[(b.FirstGap)] # subtract to get result
b.Result
I have a vector of dollar values like this (vec):
[1] 460.08 3220.56 1506.20 1363.76 1838.00 1838.00 3684.94 2352.66 1606.02
[10] 1840.05 518.98 1603.53 1556.94 347.32 253.16 12.95 1828.81 1896.32
[19] 4962.60 426.33 3237.04 1601.40 2004.57 183.80 1570.75 3622.96 230.04
[28] 426.33 3237.04 1601.40 2004.57 183.80
If I have a charge that resulted from some sum of these numbers, how could I find it? For example, if the charge was 6747.81, then it must have resulted from 1506.20 + 3237.04 + 2004.57 (the 3rd, 29th and 31st vector elements). How could I solve for these vector elements given the sum?
I would imagine finding all possible sums is the answer then matching it to the vector elements that led to it.
I have played with using combn(vec, 3) to find all 3 but this doesn't quite quite give what I want.
You'll want to use colSums (or apply) after combn to get the sums.
set.seed(100)
# Generate fake data
vec <- rpois(10, 20)
# Get all combinations of 3 elements
combs <- combn(vec, 3)
# Find the resulting sums
out <- colSums(combs)
# Making up a value to search for
val <- vec[2]+vec[6]+vec[8]
# Find which combinations lead to that value
id <- which(out == val)
# Pull out those combinations
combs[,id]
Some output to show the results for this example
> vec
[1] 17 12 23 20 21 17 21 18 22 22
> val
[1] 47
> combs[,id]
[,1] [,2]
[1,] 17 12
[2,] 12 17
[3,] 18 18
Edit: Just saw that there isn't necessarily a restriction to use 3 items. One could generalize this just by doing it for every possible sample size but I don't have time to do that right now. It would also be fairly slow for even moderately sized problems.
I know this should be simple but I just can't do it...I have a data frame called data that works nicely and does what I want it to with the correct column headers and everything. I can call colSums() to get a list of 21 numbers which are the sums of each column.
> a <- colSums(data,na.rm = TRUE)
> names(a) <- NULL
> a
[1] 1000000.00 680000.00 170000.00 462400.00 115600.00 144500.00 314432.00 78608.00 98260.00 122825.00 213813.76 53453.44 66816.80
[14] 83521.00 104401.25 145393.36 36348.34 45435.42 56794.28 70992.85 88741.06
The problem is I need a list with the first number alone, the sum of the next two, sum of the next 3, sum of the next 4 etc. until I run out of numbers. I imagine it would look something like this:
c(sum(a[1]),sum(a[2:3]),sum(a[4:6])... etc.
Any help or a different way to do this would be greatly appreciated!
Thank you.
You should only need to go out to something on the order of sqrt(length(vector)). The seq function lets you specify a start integer and a length, so sending a sequence of integers to seq(1+x*(x-1)/2, length=x) should create the right set of sequences. It wasn't clear whether incomplete sequences at the end should return a result or NA so I put in na.rm=TRUE. You might decide otherwise. (You did not illustrate a dataframe but rather an ordinary numeric vector.
sumsegs <- function(vec) sapply(1:sqrt(2*length(vec)), function(x)
sum( vec[seq(1+x*(x-1)/2, length=x)], na.rm=TRUE) )
a <- scan()
1000000.00 680000.00 170000.00 462400.00 115600.00 144500.00 314432.00 78608.00 98260.00 122825.00 213813.76 53453.44 66816.80 83521.00 104401.25 145393.36 36348.34 45435.42 56794.28 70992.85 88741.06
# 22: enter carriage return to stop scan input
#Read 21 items
sumsegs(a)
#[1] 1000000.0 850000.0 722500.0 614125.0 522006.2 443705.3
I'm not exactly sure what the right upper limit on the number to send to the inner function. sqrt(length(vec)) is too short, but sqrt(2*length(vec)) seems to be "working" at lower numbers anyway.
> sapply( sapply(1:sqrt(2*100), function(x) seq(1+x*(x-1)/2, length=x) ), max)
[1] 1 3 6 10 15 21 28 36 45 55 66 78 91 105
> sapply( sapply(1:sqrt(100), function(x) seq(1+x*(x-1)/2, length=x) ), max)
[1] 1 3 6 10 15 21 28 36 45 55
This is a function that returns the last element in sequences so formed and making the factor 2.1 rather than 2 corrects minor deficiencies in the range of length 500-1000:
tail(lapply( sapply(1:sqrt(2.1*500), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 528
tail(lapply( sapply(1:sqrt(2.1*500), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 496
Going higher did not seem to degrade the "times 2" correction. There's probably some kewl number theory explanation for this.
tail(lapply( sapply(1:sqrt(2*100000), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 100128
Alternatively a much more naive method is:
sums=colSums(data)
n=0 # number of sums
i=1 # currentIndex
intermediate=0;
newIndex=1;
newVec <- vector()
while(i<length(sums)) {
for(j in i:(i+n)) {
if(j<=length(sums))
intermediate=intermediate+sums[j]
}
if(n>1){
i=i+n+1;
}
else{
i=i+1;
}
newVec=c(newVec, intermediate);
intermediate=0;
n=n+1;
}
Here's a similar approach, using rep(...) and by(...)
n <- (-1+sqrt(1+8*length(a)))/2 # number of groups
groups <- rep(1:n,1:n) # indexing vector
result <- as.vector(by(a,groups,sum))
result
# [1] 1000000.0 850000.0 722500.0 614125.0 522006.2 443705.3