find all possible sums in vector (R) - r

I have a vector of dollar values like this (vec):
[1] 460.08 3220.56 1506.20 1363.76 1838.00 1838.00 3684.94 2352.66 1606.02
[10] 1840.05 518.98 1603.53 1556.94 347.32 253.16 12.95 1828.81 1896.32
[19] 4962.60 426.33 3237.04 1601.40 2004.57 183.80 1570.75 3622.96 230.04
[28] 426.33 3237.04 1601.40 2004.57 183.80
If I have a charge that resulted from some sum of these numbers, how could I find it? For example, if the charge was 6747.81, then it must have resulted from 1506.20 + 3237.04 + 2004.57 (the 3rd, 29th and 31st vector elements). How could I solve for these vector elements given the sum?
I would imagine finding all possible sums is the answer then matching it to the vector elements that led to it.
I have played with using combn(vec, 3) to find all 3 but this doesn't quite quite give what I want.

You'll want to use colSums (or apply) after combn to get the sums.
set.seed(100)
# Generate fake data
vec <- rpois(10, 20)
# Get all combinations of 3 elements
combs <- combn(vec, 3)
# Find the resulting sums
out <- colSums(combs)
# Making up a value to search for
val <- vec[2]+vec[6]+vec[8]
# Find which combinations lead to that value
id <- which(out == val)
# Pull out those combinations
combs[,id]
Some output to show the results for this example
> vec
[1] 17 12 23 20 21 17 21 18 22 22
> val
[1] 47
> combs[,id]
[,1] [,2]
[1,] 17 12
[2,] 12 17
[3,] 18 18
Edit: Just saw that there isn't necessarily a restriction to use 3 items. One could generalize this just by doing it for every possible sample size but I don't have time to do that right now. It would also be fairly slow for even moderately sized problems.

Related

Remove All Columns where the last row is not equal to specific value x [duplicate]

This question already has an answer here:
Subset columns based on row value
(1 answer)
Closed 4 years ago.
I have a data frame(DF) that is like so:
DF <- rbind (c(10,20,30,40,50), c(21,68,45,33,21), c(11,98,32,10,30), c(50,70,70,70,50))
10 20 30 40 50
21 68 45 33 21
11 98 32 10 30
50 70 70 70 50
In my scenario my x would be 50. So my resulting dataframe(resultDF) will look like this:
10 50
21 21
11 30
50 50
How Can I do this in r? I have attempted using subset as below but it doesn't seem to work as I am expecting:
resultDF <- subset(DF, DF[nrow(DF),] == 50)
Error in x[subset & !is.na(subset), vars, drop = drop] :
(subscript) logical subscript too long
I have solved it. My sub setting was function was inaccurate. I used the following piece of code to get the results I needed.
resultDF <- DF[, DF[nrow(DF),] == 50]
Your issue with subset() was only about the syntax for calling it with a logical column vector (its third arg, not its second). You can either use subset() or plain logical indexing. The latter is recommended.
The help page ?subset tells you its optional second arg ('subset') is a logical row-vector, and its optional third arg ('select') is a logical column-vector:
subset: logical expression indicating elements or rows to keep:
missing values are taken as false.
select: expression, indicating columns to select from a data frame.
So you want to call it with this logical column-vector:
> DF[nrow(DF),] == 50
[1] TRUE FALSE FALSE FALSE
There are two syntactical ways to leave subset()'s second arg default and pass the third arg:
# Explicitly pass the third arg by name...
> subset(DF, select=(DF[nrow(DF),] == 50) )
# Leave 2nd arg empty, it will default (to NULL)...
> subset(DF, , (DF[nrow(DF),] == 50) )
[,1] [,2]
[1,] 10 50
[2,] 21 21
[3,] 11 30
[4,] 50 50
The second way is probably preferable as it looks like generic row,col-indexing, and also doesn't require you to know the third arg's name.
(As a mnemonic, in R and SQL terminology, understand that 'select' implicitly means 'column-indices', and 'filter'/'subset' implicitly means 'row-indices'. Or in data.table terminology they're called i-indices, j-indices respectively.)

Arithmetic Progression series in R

I am new to this forum. I guess something like this has been asked before but, I am not really sure if that is what I want.
I have a sequence like this,
1 2 3 4 5 8 9 10 12 14 15 17 18 19
So, what I wish to do is this, get all the numbers which form a series,i.e.the numbers that belonging to that set should all have a constant difference with the previous element, and also the minimum number of elements should be 3 in that set.
i.e., I can see that (1,2,3,4,5) forms one such series in which numbers appear after an interval of 1 and the total size of this set is 5 which satisfies the minimum threshold criteria.
(1,3,5) forms one such a pattern in which the numbers appear after an interval of 2.
(8,10,12,14) forms another such pattern with an interval of 2. So, as you can see, the interval of repetition can be anything.
Also, for a particular set, I want its maximal one. I dont want, (8,10,12) (although it satisfies the minimum threshold of 3 and constant difference ) as the output and only of the maximal length I want, i.e. (8,10,12,14).
Similarly, for, (1,2,3,4,5) , I dont want (1,2,3) or (2,3,4,5) as the output, only the MAXIMAL LENGTH ONE I WANT, i.e. (1,2,3,4,5).
How can I do this in R?
Edit: That is, I want any set which forms a basic AP series with any difference, however the total value should be greater than 3 in that series and it should be maximal.
Edit2: I have tried using rle and acf in R but that doesnt entirely solves my problem.
Edit3: When I did acf, it basically gave me the maximum peak difference that I could have used. However, I want all the differences possible. Also, rle is just way different. It gave me the longest continuous sequence of similar numbers. Which is not there in my case.
If you are looking for sequences of consecutive numbers, then cgwtools::seqle will find them for you in the same way rle finds a sequence of repeated values.
In the general case of basically any subset of your data which form such a sequence, such as the 8,10,12,14 case you cite, your criteria are so general as to be very difficult to satisfy. You'd have to start at each element of your series and do a forward-looking search for x[j] +1, x[j]+2, x[j]+3 ... ad infinitum. This suggests using some tree-based algorithms.
Here's a potential solution - albeit a very ugly, sloppy one:
##
arithSeq <- function(x=nSeq, minSize=4){
##
dx <- diff(x,lag=1)
Runs <- rle(diff(x))
##
rLens <- Runs[[1]]
rVals <- Runs[[2]]
pStart <- c(
rep(1,rLens[1]),
rep(cumsum(1+rLens[-length(rLens)]),times=rLens[-1])
)
pEnd <- pStart + c(
rep(rLens[1]-1, rLens[1]),
rep(rLens[-1],times=rLens[-1])
)
pGrp <- rep(1:length(rLens),times=rLens)
pLen <- rep(rLens, times=rLens)
dAll <- data.frame(
pStart=pStart,
pEnd=pEnd,
pGrp=pGrp,
pLen=pLen,
runVal=rep(rVals,rLens)
)
##
dSub <- subset(dAll, pLen >= minSize - 1)
##
uVals <- unique(dSub$runVal)
##
maxSub <- subset(dSub, runVal==uVals[1])
maxLen <- max(maxSub$pLen)
maxSub <- subset(maxSub, pLen==maxLen)
##
if(length(uVals) > 1){
for(i in 2:length(uVals)){
iSub <- subset(dSub, runVal==uVals[i])
iMaxLen <- max(iSub$pLen)
iSub <- subset(iSub, pLen==iMaxLen)
maxSub <- rbind(
maxSub,
iSub)
maxSub
}
##
}
##
deDup <- maxSub[!duplicated(maxSub),]
seqStarts <- as.numeric(rownames(deDup))
outList <- list(NULL); length(outList) <- nrow(deDup)
for(i in 1:nrow(deDup)){
outList[[i]] <- list(
Sequence = x[seqStarts[i]:(seqStarts[i]+deDup[i,"pLen"])],
Length=deDup[i,"pLen"]+1,
StartPosition=seqStarts[i],
EndPosition=seqStarts[i]+deDup[i,"pLen"])
outList
}
##
return(outList)
##
}
##
So there are things that can definitely be improved in this function - for instance I made a mistake somewhere in the calculation of pStart and pEnd, the start and end indices of a given arithmetic sequence, but it just so happened that the true start positions of such sequences are given as the rownumbers of one of the intermediate data.frames, so that was a hacky sort of solution. Anyways, it accepts a numeric vector x and a minimum length parameter, minSize. It will return a list containing information about sequences meeting the criteria you outlined above.
set.seed(1234)
lSeq <- sample(1:25,100000,replace=TRUE)
nSeq <- c(1:10,12,33,13:17,16:26)
##
> arithSeq(nSeq)
[[1]]
[[1]]$Sequence
[1] 16 17 18 19 20 21 22 23 24 25 26
[[1]]$Length
[1] 11
[[1]]$StartPosition
[1] 18
[[1]]$EndPosition
[1] 28
##
> arithSeq(x=lSeq,minSize=5)
[[1]]
[[1]]$Sequence
[1] 13 16 19 22 25
[[1]]$Length
[1] 5
[[1]]$StartPosition
[1] 12760
[[1]]$EndPosition
[1] 12764
[[2]]
[[2]]$Sequence
[1] 11 13 15 17 19
[[2]]$Length
[1] 5
[[2]]$StartPosition
[1] 37988
[[2]]$EndPosition
[1] 37992
Like I said, its sloppy and inelegant, but it should get you started.

How to sum specific vectors in a list in R

I know this should be simple but I just can't do it...I have a data frame called data that works nicely and does what I want it to with the correct column headers and everything. I can call colSums() to get a list of 21 numbers which are the sums of each column.
> a <- colSums(data,na.rm = TRUE)
> names(a) <- NULL
> a
[1] 1000000.00 680000.00 170000.00 462400.00 115600.00 144500.00 314432.00 78608.00 98260.00 122825.00 213813.76 53453.44 66816.80
[14] 83521.00 104401.25 145393.36 36348.34 45435.42 56794.28 70992.85 88741.06
The problem is I need a list with the first number alone, the sum of the next two, sum of the next 3, sum of the next 4 etc. until I run out of numbers. I imagine it would look something like this:
c(sum(a[1]),sum(a[2:3]),sum(a[4:6])... etc.
Any help or a different way to do this would be greatly appreciated!
Thank you.
You should only need to go out to something on the order of sqrt(length(vector)). The seq function lets you specify a start integer and a length, so sending a sequence of integers to seq(1+x*(x-1)/2, length=x) should create the right set of sequences. It wasn't clear whether incomplete sequences at the end should return a result or NA so I put in na.rm=TRUE. You might decide otherwise. (You did not illustrate a dataframe but rather an ordinary numeric vector.
sumsegs <- function(vec) sapply(1:sqrt(2*length(vec)), function(x)
sum( vec[seq(1+x*(x-1)/2, length=x)], na.rm=TRUE) )
a <- scan()
1000000.00 680000.00 170000.00 462400.00 115600.00 144500.00 314432.00 78608.00 98260.00 122825.00 213813.76 53453.44 66816.80 83521.00 104401.25 145393.36 36348.34 45435.42 56794.28 70992.85 88741.06
# 22: enter carriage return to stop scan input
#Read 21 items
sumsegs(a)
#[1] 1000000.0 850000.0 722500.0 614125.0 522006.2 443705.3
I'm not exactly sure what the right upper limit on the number to send to the inner function. sqrt(length(vec)) is too short, but sqrt(2*length(vec)) seems to be "working" at lower numbers anyway.
> sapply( sapply(1:sqrt(2*100), function(x) seq(1+x*(x-1)/2, length=x) ), max)
[1] 1 3 6 10 15 21 28 36 45 55 66 78 91 105
> sapply( sapply(1:sqrt(100), function(x) seq(1+x*(x-1)/2, length=x) ), max)
[1] 1 3 6 10 15 21 28 36 45 55
This is a function that returns the last element in sequences so formed and making the factor 2.1 rather than 2 corrects minor deficiencies in the range of length 500-1000:
tail(lapply( sapply(1:sqrt(2.1*500), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 528
tail(lapply( sapply(1:sqrt(2.1*500), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 496
Going higher did not seem to degrade the "times 2" correction. There's probably some kewl number theory explanation for this.
tail(lapply( sapply(1:sqrt(2*100000), function(x) seq(1+x*(x-1)/2, length=x) ), max),1 )
[[1]]
[1] 100128
Alternatively a much more naive method is:
sums=colSums(data)
n=0 # number of sums
i=1 # currentIndex
intermediate=0;
newIndex=1;
newVec <- vector()
while(i<length(sums)) {
for(j in i:(i+n)) {
if(j<=length(sums))
intermediate=intermediate+sums[j]
}
if(n>1){
i=i+n+1;
}
else{
i=i+1;
}
newVec=c(newVec, intermediate);
intermediate=0;
n=n+1;
}
Here's a similar approach, using rep(...) and by(...)
n <- (-1+sqrt(1+8*length(a)))/2 # number of groups
groups <- rep(1:n,1:n) # indexing vector
result <- as.vector(by(a,groups,sum))
result
# [1] 1000000.0 850000.0 722500.0 614125.0 522006.2 443705.3

average number of words in a character vector in R

i'm trying to get the average number of words in my character vector in R
one <- c(9, 23, 43)
two <- c("this is a new york times article.", "short article.", "he went outside to smoke a cigarette.")
mydf <- data.frame(one, two)
mydf
# one two
# 1 9 this is a new york times article.
# 2 23 short article.
# 3 43 he went outside to smoke a cigarette.
i'm looking for a function that gives me the average number of words of character vector "two".
the output here should be 5.3333 (=(7+2+7)/3)
Here's a possibility with the qdap package:
library(qdap)
wc(mydf$two, FALSE)/nrow(mydf)
## [1] 5.333333
This is overkill but you could also do:
word_stats(mydf$two)
## all n.sent n.words n.char n.syl n.poly wps cps sps psps cpw spw pspw n.state proDF2 n.hapax n.dis grow.rate prop.dis
## 1 all 3 16 68 23 3 5.333 22.667 7.667 1 4.250 1.438 .188 3 1 12 2 .750 .125
And wps column is words per sentence.
Or gregexpr()
mean(sapply(mydf$two,function(x)length(unlist(gregexpr(" ",x)))+1))
[1] 5.333333
Hadley Wickham's stringr package provides possibly the easiest way for this:
library(stringr)
foo<- str_split(two, " ") # split each element of your vector by the space sign
sapply(foo,length) # just a quick test: how many words has each element?
sum(sapply(foo,length))/length(foo) # calculate sum and divide it by the length of your original object
[1] 5.333333
I'm sure there are some more elaborated methods available but you can use strsplit to split your strings at spaces into a character vector and count its length of elements.
mean(sapply(strsplit(as.character(mydf$two), "[[:space:]]+"), length))
# [1] 5.3333

what number is greater than x and falls after position y

Which number of x is > 5 and falls after the 10th position? It is the number in position 11.
But I find myself writing a long code to get to the answer and I am wondering if there is a quicker way.
x <- c(5,7,3,6,9,4,1,4,7,10,8,5,7,9,7,1, 8, 4, 4,9);
Identify the location of all numbers >5 call it x1:
x1 <- which(x>5);
Identify the first number of the locations(x1) that occurred after 10th position:
first(which(x1 >10))
this identifies location 6 of x1;
identify the location of that number in the original vector (x):
x1[first(which(x1 >10))];
now we have the position of the value we want in the original vector (x), and this code pulls the value we want:
x[x1[first(which(x1 >10))]]
This seems like a very long code to answer a simple question, do you know shorter/simpler way to get to same results?
What's wrong with some very simple indexing and subseting?
This gives the indices of all elements in x greater than 5 and which occur after position 10:
> which(x > 5 & seq_along(x) > 10)
[1] 11 13 14 15 17 20
So the ultimate Answer is
> which(x > 5 & seq_along(x) > 10)[1]
[1] 11
or
> head(which(x > 5 & seq_along(x) > 10), 1)
[1] 11
The trick used here is to generate a vector of indices of x using the seq_along() function. That way we can generate all the matches in a single logical statement, and then select the first of these.
If you want to extract the identified element then:
> want <- which(x > 5 & seq_along(x) > 10)[1]
> x[want]
[1] 8
[It wasn't clear whether you wanted to identify which element met your criteria or the value of that element.]
An alternative version using functional programming to extract the index:
> Find(function(y) x[y] > 5 & y > 10, seq_along(x))
[1] 11
or to extract the element:
> x[Find(function(y) x[y] > 5 & y > 10, seq_along(x))]
[1] 8
Don't know about difference in performance compared with simple indexing, though.

Resources