seq_along() - truncating a replication in r - r

I would like to generate the month number to go along with a list of values. The problem is that the list is not a full 2 replications of 12 months. It is 12 from the first year and 10 from the second year.
tibble(value=rnorm(22))
Some things I have tried are rep(1:12,2), thinking that the sequence would stop
when it hit the end of the length of the dataframe. I also tried seq_along(along.with=value,1:12) with the same line of thinking.

You want the length.out argument to rep():
rep(1:12, length.out = 22)
which gives
> rep(1:12, length.out = 22)
[1] 1 2 3 4 5 6 7 8 9 10 11 12 1 2 3 4 5 6 7 8 9 10
We get this because, from ?rep:
‘length.out’ may be given in place of ‘times’, in which case ‘x’
is repeated as many times as is necessary to create a vector of
this length. If both are given, ‘length.out’ takes priority and
‘times’ is ignored.

I would roll out 22 months and then use a modulo operator to get months in subsequent year(s)
library(dplyr)
tibble(value=rnorm(22)) %>%
mutate(month=1:22,
month=ifelse(month%%12==0, 12, month%%12)

Related

Search for value within a range of values in two separate vectors

This is my first time posting to Stack Exchange, my apologies as I'm certain I will make a few mistakes. I am trying to assess false detections in a dataset.
I have one data frame with "true" detections
truth=
ID Start Stop SNR
1 213466 213468 10.08
2 32238 32240 10.28
3 218934 218936 12.02
4 222774 222776 11.4
5 68137 68139 10.99
And another data frame with a list of times, that represent possible 'real' detections
possible=
ID Times
1 32239.76
2 32241.14
3 68138.72
4 111233.93
5 128395.28
6 146180.31
7 188433.35
8 198714.7
I am trying to see if the values in my 'possible' data frame lies between the start and stop values. If so I'd like to create a third column in possible called "between" and a column in the "truth" data frame called "match. For every value from possible that falls between I'd like a 1, otherwise a 0. For all of the rows in "truth" that find a match I'd like a 1, otherwise a 0.
Neither ID, not SNR are important. I'm not looking to match on ID. Instead I wand to run through the data frame entirely. Output should look something like:
ID Times Between
1 32239.76 0
2 32241.14 1
3 68138.72 0
4 111233.93 0
5 128395.28 0
6 146180.31 1
7 188433.35 0
8 198714.7 0
Alternatively, knowing if any of my 'possible' time values fall within 2 seconds of start or end times would also do the trick (also with 1/0 outputs)
(Thanks for the feedback on the original post)
Thanks in advance for your patience with me as I navigate this system.
I think this can be conceptulised as a rolling join in data.table. Take this simplified example:
truth
# id start stop
#1: 1 1 5
#2: 2 7 10
#3: 3 12 15
#4: 4 17 20
#5: 5 22 26
possible
# id times
#1: 1 3
#2: 2 11
#3: 3 13
#4: 4 28
setDT(truth)
setDT(possible)
melt(truth, measure.vars=c("start","stop"), value.name="times")[
possible, on="times", roll=TRUE
][, .(id=i.id, truthid=id, times, status=factor(variable, labels=c("in","out")))]
# id truthid times status
#1: 1 1 3 in
#2: 2 2 11 out
#3: 3 3 13 in
#4: 4 5 28 out
The source datasets were:
truth <- read.table(text="id start stop
1 1 5
2 7 10
3 12 15
4 17 20
5 22 26", header=TRUE)
possible <- read.table(text="id times
1 3
2 11
3 13
4 28", header=TRUE)
I'll post a solution that I'm pretty sure works like you want it to in order to get you started. Maybe someone else can post a more efficient answer.
Anyway, first I needed to generate some example data - next time please provide this from your own data set in your post using the function dput(head(truth, n = 25)) and dput(head(possible, n = 25)). I used:
#generate random test data
set.seed(7)
truth <- data.frame(c(1:100),
c(sample(5:20, size = 100, replace = T)),
c(sample(21:50, size = 100, replace = T)))
possible <- data.frame(c(sample(1:15, size = 15, replace = F)))
colnames(possible) <- "Times"
After getting sample data to work with; the following solution provides what I believe you are asking for. This should scale directly to your own dataset as it seems to be laid out. Respond below if the comments are unclear.
#need the %between% operator
library(data.table)
#initialize vectors - 0 or false by default
truth.match <- c(rep(0, times = nrow(truth)))
possible.between <- c(rep(0, times = nrow(possible)))
#iterate through 'possible' dataframe
for (i in 1:nrow(possible)){
#get boolean vector to show if any of the 'truth' rows are a 'match'
match.vec <- apply(truth[, 2:3],
MARGIN = 1,
FUN = function(x) {possible$Times[i] %between% x})
#if any are true then update the match and between vectors
if(any(match.vec)){
truth.match[match.vec] <- 1
possible.between[i] <- 1
}
}
#i think this should be called anyMatch for clarity
truth$anyMatch <- truth.match
#similarly; betweenAny
possible$betweenAny <- possible.between

Searching the closest value in other column

Suppose we have a data frame of two columns
X Y
10 14
12 16
14 17
15 19
21 19
The first element of Y that is 14, the nearest value (or same) to it is 14 (which is 3rd element of X). Similarly, next element of Y is closest to 15 that is 4th element of X
So, the output I would like should be
3
4
4
5
5
As my data is large, Can you give me some advice on the systemic/proper code for doing it?
You can try this piece of code:
apply(abs(outer(d$X,d$Y,FUN = '-')),2,which.min)
# [1] 3 4 4 5 5
Here, abs(outer(d$X,d$Y,FUN = '-')) returns a matrix of unsigned differences between d$X and d$Y, and apply(...,2,which.min) will return position of the minimum by row.

Efficient method of obtaining successive high values of data.frame column

Lets say I have the following data.frame in R
df <- data.frame(order=(1:10),value=c(1,7,3,5,9,2,9,10,2,3))
Other than looping through data an testing whether value exceeds previous high value how can I get successive high values so that I can end up with a table like this
order value
1 1
2 7
5 9
8 10
TIA
Here's one option, if I understood the question correct:
df[df$value > cummax(c(-Inf, head(df$value, -1))),]
# order value
#1 1 1
#2 2 7
#5 5 9
#8 8 10
I use cummax to keep track of the maximum of column "value" and compare it (the previous row's cummax) to each "value" entry. To make sure the first entry is also selected, I start by "-Inf".
"get successive high values (of value?)" is unclear.
It seems you want to filter only rows whose value is higher than previous max.
First, we reorder your df in increasing order of value... (not clear but I think that's what you wanted)
Then we use logical indexing with diff()>0 to only include strictly-increasing rows:
rdf <- df[order(df$value),]
rdf[ diff(rdf$value)>0, ]
order value
1 1 1
9 9 2
10 10 3
4 4 5
2 2 7
7 7 9
8 8 10

Filter between threshold

I am working with a large dataset and I am trying to first identify clusters of values that meet specific threshold values. My aim then is to only keep clusters of a minimum length. Below is some example data and my progress thus far:
Test = c("A","A","A","A","A","A","A","A","A","A","B","B","B","B","B","B","B","B","B","B")
Sequence = c(1,2,3,4,5,6,7,8,9,10,1,2,3,4,5,6,7,8,9,10)
Value = c(3,2,3,4,3,4,4,5,5,2,2,4,5,6,4,4,6,2,3,2)
Data <- data.frame(Test, Sequence, Value)
Using package evd, I have identified clusters of values >3
C1 <- clusters(Data$Value, u = 3, r = 1, cmax = F, plot = T)
Which produces
C1
$cluster1
4
4
$cluster2
6 7 8 9
4 4 5 5
$cluster3
12 13 14 15 16 17
4 5 6 4 4 6
My problem is twofold:
1) I don't know how to relate this back to the original dataframe (for example to Test A & B)
2) How can I only keep clusters with a minimum size of 3 (thus excluding Cluster 1)
I have looked into various filtering options etc. however they do not cluster data according to a desired threshold, with no options for the minimum size of the cluster either.
Any help is much appreciated.
Q1: relate back to original dataframe: Have a look at Carl Witthoft's answer. He wrote a variant of rle() (seqle() because it allows one to look for integer sequences rather than repetitions): detect intervals of the consequent integer sequences
Q2: only keep clusters of certain length:
C1[sapply(C1, length) > 3]
yields the 2 clusters that are long enough:
$cluster2
6 7 8 9
4 4 5 5
$cluster3
12 13 14 15 16 17
4 5 6 4 4 6

Combination with a minimum number of elements in a fixed length subset

I have been searching for long but unable to find a solution for this.
My question is "Suppose you have n street lights(cannot be moved) and if you get any m from them then it should have atleast k working.Now in how many ways can this be done"
This seems to be a combination problem, but the problem here is "m" must be sequential.
Eg:
1 2 3 4 5 6 7 (Street lamps)
Let m=3
Then the valid sets are,
1 2 32 3 43 4 54 5 65 6 7Whereas,1 2 4 and so are invalid selections.
So every set must have atleast 2 working lights. I have figured how to find the minimum lamps required to satisfy the condition but how can I find the number of ways in it can be done ?
There should certainly some formula to do this but I am unable to find it.. :(
Should always be (n-m)+1.
E.g., 10 lights (n = 10), 5 in set (m = 5):
1 2 3 4 5
2 3 4 5 6
3 4 5 6 7
4 5 6 7 8
5 6 7 8 9
6 7 8 9 10
Gives (10-5)+1 = 6 sets.
The answer should always be m choose k for all values of n where n > m > k. I'll try to explain why;
Given, for example, the values m = 10, n = 4, k = 2, you can start by generating all possible permutations of 1s and 0s for sets of 4 lights, with exactly 2 lights on;
1100
0110
0011
1001
0101
1010
As you can see, there are 6 permutations, because 4 choose 2 = 6. You can choose any of these 6 permutations to be the first 4 lights. You then continue the sequence until you get n (in this case 10) lights, ensuring that you only ever add a zero if you must in order to keep the condition true of having 2 lights on for every 4. What you will find is that the sequence simply repeats; for example:
1100 -> next can be 1, so 11001
Next can still be 1 and meet the condition, so 110011.
The next must now be a zero, giving 1100110, and then again -> 11001100. This simply continues until the length is n : 1100110011. Given that the starting four can only be one of the above set, you will only get 6 different permutations.
Now, since the sequence will repeat exactly the same for any value of n, it means that the answer will always be m choose k.
For your example in your comment of 6,3,2, I can only find the following permutations:
011011
110110
101101
Which works, because 3 choose 2 = 3. If you can find more, then I guess I'm wrong and I've probably misunderstood again :D but from my understanding of this problem, I'm certain that the answer will always be m choose k.

Resources