Finding all combinations of four numbers that equal a sum in R [duplicate] - r

This question already has an answer here:
Find all combinations of numbers that sum to a target
(1 answer)
Closed 7 years ago.
Is there a more efficient way to determine all possible values of 4 numbers that sum a particular value. I have used the following but if I expand it more then ten numbers per group or more then 4 groups it will be inefficient
Grid <- expand.grid(a=seq(0, 100, 10), b= seq(0,100,10), c= seq(0,100,10), d=seq(0,100,10))
Grid$total <- apply(Grid, 1, sum)
Grid[Grid$total==100,]
I my real application, the number will be percentages that will equal to 1 and will be adjusted by intervals of no less then 5

I'm sure there are many solutions, here is one with partitions library,
library(partitions)
restrictedparts(10, 4, include.zero = FALSE)
# [1,] 7 6 5 4 5 4 3 4 3
# [2,] 1 2 3 4 2 3 3 2 3
# [3,] 1 1 1 1 2 2 3 2 2
# [4,] 1 1 1 1 1 1 1 2 2
This would be the 4 integers that sum to 10 (not including 0).

Related

Assigning vector elements a value associated with preceding matching value [duplicate]

This question already has answers here:
Calculating cumulative sum for each row
(6 answers)
Sum of previous rows in a column R
(1 answer)
Closed 3 years ago.
I have a vector of alternating TRUE and FALSE values:
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
I'd like to number each instance of TRUE with a unique sequential number and to assign each FALSE value the number associated with the TRUE value preceding it.
therefore, my desired output using the example dat above (which has 4 TRUE values):
1 1 1 2 2 2 2 3 3 4 4 4 4 4
What I tried:
I've tried the following (which works), but I know there must be a simpler solution!!
whichT <- which(dat==T)
whichF <- which(dat==F)
l1 <- lapply(1:length(whichT),
FUN = function(x)
which(whichF > whichT[x] & whichF < whichT[(x+1)])
)
l1[[length(l1)]] <- which(whichF > whichT[length(whichT)])
replaceFs <- unlist(
lapply(1:length(whichT),
function(x) l1[[x]] <- rep(x,length(l1[[x]]))
)
)
replaceTs <- 1:length(whichT)
dat2 <- dat
dat2[whichT] <- replaceTs
dat2[whichF] <- replaceFs
dat2
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
I need a simpler and quicker solution b/c my real data set is 181k rows long!
Base R solutions preferred, but any solution works
cumsum(dat) will do what you want. When used in mathematical functions TRUE gets converted to 1 and FALSE to 0 so taking the cumulative sum will add 1 every time you see a TRUE and add nothing when there is a FALSE which is what you want.
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
cumsum(dat)
# [1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
Instead of doing the indexing, it can be easily done with cumsum from base R. Here, TRUE/FALSE gets coerced to 1/0 and when we do the cumulative sum, whereever there is 1, it gets increment by 1
cumsum(dat)
#[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
cumsum() is the most straightforward way, however, you can also do:
Reduce("+", dat, accumulate = TRUE)
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4

apply conditional numbering to grouped data in R

I have a table like the one below with 100's of rows of data.
ID RANK
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3
4 6
4 7
4 7
4 7
4 7
4 7
4 6
I want to try to find a way to group the data by ID so that I can ReRank each group separately. The ReRank column is based on the Rank column and basically renumbering it starting at 1 from least to greatest, but it's important to note that the the number in the ReRank column can be put in more than once depending on the numbers in the Rank column .
In other words, the output needs to look like this
ID Rank ReRANK
1 3 2
1 2 1
1 3 2
2 4 1
2 8 2
3 3 1
3 3 1
3 3 1
For the life of me, I can't figure out how to be able to ReRank the the columns by the grouped columns and the value of the Rank columns.
This has been my best guess so far, but it definitely is not doing what I need it to do
ReRANK = mat.or.vec(length(RANK),1)
ReRANK[1] = counter = 1
for(i in 2:length(RANK)) {
if (RANK[i] != RANK[i-1]) { counter = counter + 1 }
ReRANK[i] = counter
}
Thank you in advance for the help!!
Here is a base R method using ave and rank:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) rank(i, ties.method="min"))
The min argument in rank assures that the minimum ranking will occur when there are ties. the default is to take the mean of the ranks.
In the case that you have ties lower down in the groups, rank will count those lower values and then add continue with the next lowest value as the count of the lower values + 1. These values wil still be ordered and distinct. If you really want to have the count be 1, 2, 3, and so on rather than 1, 3, 6 or whatever depending on the number of duplicate values, here is a little hack using factor:
df$ReRank <- ave(df$Rank, df$ID, FUN=function(i) {
as.integer(factor(rank(i, ties.method="min"))))
Here, we use factor to build values counting from upward for each level. We then coerce it to be an integer.
For example,
temp <- c(rep(1, 3), 2,5,1,4,3,7)
[1] 2.5 2.5 2.5 5.0 8.0 2.5 7.0 6.0 9.0
rank(temp, ties.method="min")
[1] 1 1 1 5 8 1 7 6 9
as.integer(factor(rank(temp, ties.method="min")))
[1] 1 1 1 2 5 1 4 3 6
data
df <- read.table(header=T, text="ID Rank
1 2
1 3
1 3
2 4
2 8
3 3
3 3
3 3 ")

Identify repetitive pattern in numeric vector in R with fuzzy search [duplicate]

This question already has answers here:
Find and break on repeated runs
(3 answers)
Closed 6 years ago.
Imagine a vector of integers like so:
> rep(c(1,4,2),10)
[1] 1 4 2 1 4 2 1 4 2 1 4 2 1 4 2 1 4 2 1 4 2 1 4 2 1 4 2 1 4 2
For us human beings it seems easy to identify the pattern 1 - 4 - 2 even without knowing the function how the vector was created. But how would you identify this pattern using R?
Edit
As this question was marked as a dupe I'm going to specify it a bit. The above example was an easy one to explain the idea. The main goal would be to identify more hidden patterns like 1 4 2 5 6 7 1 4 2 9 1 4 2 3 4 5 1 4 2 and also patterns that are approximately the same like 1 4 2 1 4 1.99 1 4 2 1.01 4 2 1 4.01 2. What are the ideas to always Identify the pattern 1 4 2 in those cases?
Assuming that the subpattern must start at the beginning and repeat to the end of the input try it for a subpattern length of k = 1, 2, 3, ... We have assumed that only patterns that are half the length of the input or less are to be considered:
for(k in seq_len(length(x)/2)) {
pat <- x[1:k]
if (identical(rep(pat, length = length(x)), x)) {
print(pat)
break
}
}
## [1] 1 4 2
Note: This was used as the input x:
x <- rep(c(1, 4, 2), 10)

Maximum and mean lengths of streaks/runs of identical responses

We have a dataset with ID numbers in the first column and then responses to each of 240 questions in the following 240 columns. We'd like to assess the validity of the responses for each subject by finding the maximum and mean of the lengths of streaks or runs of identical responses. For example, if a subject responded (1, 1, 1, 2, 2, 5, 5, 5, 5, 1) to ten questions, the maximum would be 4 and the mean would be 2.5.
I have tried to solve this problem in R using rle(), but after I apply rle() to every row of the data frame I can't extract the lengths. Once I extract the lengths, I think it would be relatively easy to apply max() and mean(). Any help or advice on getting to that point would be appreciated.
There are two more issues that are minor and don't necessarily need to be answered here. The first is that it would be even more informative to find the maximum and mean per response (there are five possible responses, namely, 1 through 5). In the example above, the maxima and means for 1, 2, and 5 would be, respectively, 3 and 2, 2 and 2, and 4 and 4. The second is that I don't know how to apply rle() to the 240 responses exclusively, i.e. and not also to the ID number. I've been deleting the ID number column before manipulating the data frame in R, which is fine, but will lead to error if I unintentionally rearrange the rows.
Thank you!
The rle function returns a list, but this is not immediately obvious because it is possible to make R print whatever you want when you type the name of an object and the authors of rle have made it print something else. In order to find out the structure of an object, you can use str, for example
x <- c(1, 1, 1, 2, 2, 5, 5, 5, 5, 1)
codes <- rle(x)
str(codes)
You can get at the lengths by typing codes$lengths and similarly for the corresponding values.
Anyway, notwithstanding the statistical issues, here is how to do what you want. Suppose you have 30 subjects and they have responded to eight questions. Your data might look like this
set.seed(123)
repsonses <- data.frame(matrix(sample(0:5, 8*30, replace=T), nc=8))
> head(responses)
X1 X2 X3 X4 X5 X6 X7 X8
1 3 2 4 2 4 1 1 5
2 1 5 2 1 5 3 1 1
3 1 3 1 2 3 5 5 3
4 4 4 5 3 4 2 4 2
5 5 5 2 5 3 1 2 4
6 3 3 3 3 1 1 3 2
You can extract the maximum lengths of the runs for each subject like this:
> max.lengths <- apply(responses, 1, function(x) max(rle(x)$lengths))
> max.lengths
[1] 2 2 2 2 2 4 3 1 1 2 2 1 2 3 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1
The max length was 2 for the first 5 subjects and 4 for the sixth subject, so it looks right.
Similarly for the mean lengths
> mean.lengths <- apply(responses, 1, function(x) mean(rle(x)$lengths))
> head(mean.lengths)
[1] 1.142857 1.142857 1.142857 1.142857 1.142857 2.000000
For example, the mean length for the first person was the mean of $1,1,1,1,1,2,1$ which is $8/7$, which agrees with what R says.
To break down the whole thing by response, you can use the same ideas and the tapply function like this:
bd <- function(x){
means <- tapply(x$lengths, factor(x$values,levels=0:5), mean)
means[is.na(means)] <- 0
maxes <- tapply(x$lengths, factor(x$values,levels=0:5), max)
maxes[is.na(maxes)] <- 0
M <- rbind(means, maxes)
rownames(M) <- c("mean", "max")
M
}
lapply(apply(responses, 1, rle), bd)
This outputs another list. For example, if you scroll up, you will see that for subject 25, it says
[[25]]
0 1 2 3 4 5
mean 0 1 2 1 0 2
max 0 1 2 1 0 2
compare with
> responses[25,]
X1 X2 X3 X4 X5 X6 X7 X8
25 3 5 5 3 2 2 1 3
so it is giving the correct answer. You can give this list a name, for example
break.downs <- lapply(apply(responses, 1, rle), bd)
and then you can access the entry for subject i by typing
break.downs[[i]]
For the problem with the ID number column, if it's included, say as column 1, you can just do the whole analysis to responses[ ,-1] and that should be OK. The $-1$ just deletes the first column.
PS. Sorry, I just noticed that I did it with repsonses $0$ to $5$ instead of $1$ to $5$, but you just need to change levels=0:5 to levels=1:5 in the bd function and it should work just as well.
I am partial to the data.table package. To use it, first reshape to long format. Then use rle (making sure to take the first list element of the result, using [[1]]), take the max/mean, and group by the respondent ID.
Here is an example with five respondents and 10 questions:
library(data.table)
set.seed(8028)
responses <- data.frame(cbind(id=1:5,matrix(sample(1:5, 10*5, replace=T), nc=10)))
responses
# id V2 V3 V4 V5 V6 V7 V8 V9 V10 V11
# 1 1 3 4 2 5 1 2 4 4 1 3
# 2 2 2 2 4 5 5 2 3 3 3 1
# 3 3 5 1 3 3 4 4 1 4 2 2
# 4 4 3 2 4 5 2 2 1 4 1 3
# 5 5 5 2 4 5 3 1 4 1 2 4
responses.long<-data.table(reshape(responses, idvar="id", varying=list(2:11), direction="long"),key=c("id","time"))
responses.long[,list(run=max(rle(V2)[[1]]), mean=mean(rle(V2)[[1]])), by="id"]
# id run mean
# 1: 1 2 1.111111
# 2: 2 3 1.666667
# 3: 3 2 1.428571
# 4: 4 2 1.111111
# 5: 5 1 1.000000
Wouldn't this question by more appropriate for StackOverflow?

Data simulation according to specific rules in R

I need help simulating a dataset.
It is supposed to simulate all possible outcomes on a signal detection theory task (participants are presented with trials and have to decide whether or not they detected given signal). Now, I need a dataset of all possible values for varying number of trials.
Say, there are 6 trials, 5 with the signal present, 5 with the signal absent. I am only interested in correct detections (hits) and false alarms (Type I errors). A participant can correctly detect between 1 (I don't need 0's) and 5 and make the same number of false alarms. With all possible combinations, that would be dataset containing two variables with 5^2 cases each. To make things more complicated, even the number of trials is variable. The number of both signal and non-signal trials can vary between 1 and 20 but the total number of trials cannot be less than 3 (either 1 S trial and 2 Non-S trials, or the other way around). And for each possible combination of trials, there is a group of possible combinations of hits and false alarms.
What I need is a dataset with 5 variables (total N, N of S trials, N of Non-S trials, N of Hits, and N of False Alarms) with all the possible values.
EXAMPLE
Here are all possible data for total N of 4. Note that Signal + Noise = N_total and that N_Hit seq(1:Signal) and N_FA seq(1:Noise)
N_total Signal Noise N_Hit N_FA
4 1 3 1 1
4 1 3 1 2
4 1 3 1 3
4 2 2 1 1
4 2 2 1 2
4 2 2 2 1
4 2 2 2 2
4 3 1 1 1
4 3 1 2 1
4 3 1 3 1
I'm an R novice so any help at all would be much appreciated!
Hope the description is clear.
I created a function, which uses the number of trials as parameter.
myfunc <- function(n) {
# create a data frame of all combinations
grid <- expand.grid(rep(list(seq_len(n - 1)), 4))
# remove invalid combinations (keep valid ones)
grid <- grid[grid[3] <= grid[1] & # number of hits <= number of signals
grid[4] <= grid[2] & # false alarms <= noise
(grid[1] + grid[2]) == n , ] # signal and noise sum to total n
# remove signal and noise > 20
grid <- grid[!rowSums(grid[1:2] > 20), ]
# sort rows
grid <- grid[order(grid[1], grid[3], grid[4]), ]
# add total number of trials
res <- cbind(n, grid)
# remove row names, add column names and return the object
return(setNames("rownames<-"(res, NULL),
c("N_total", "Signal", "Noise", "N_Hit", "N_FA")))
}
Use the function:
> myfunc(4)
N_total Signal Noise N_Hit N_FA
1 4 1 3 1 1
2 4 1 3 1 2
3 4 1 3 1 3
4 4 2 2 1 1
5 4 2 2 1 2
6 4 2 2 2 1
7 4 2 2 2 2
8 4 3 1 1 1
9 4 3 1 2 1
10 4 3 1 3 1
How to apply this function to the values 3-40:
lapply(3:40, myfunc)
This will return a list of data frames.

Resources