R: Produce Index Values to Group Increasing Values in Vector - r

I have a list of increasing year values that occasionally has breaks in it and I want to create a grouping value for each unbroken sequence. Think of a vector like this one (missing 2005,2011):
x <- c(2001,2002,2003,2004,2006,2007,2008,2009,2010,2013,2014,2015,2016)
I would like to produce an equal length vector that numbers every value in a run with the same index to end up with something like this.
[1] 1 1 1 1 2 2 2 2 2 3 3 3 3
I would like to do this using best R practices so I am trying to avoid falling back to a for loop but I am not sure how to get from Vector A to Vector B. Does anyone have any suggestions?
Some things I know I can do:
I can flag the record before or after a gap as true with an ifelse
I can get the index of when the counter should change by wrapping that in a which statement
This is the code to do each
ifelse(!is.na(lag(x)) & x == lag(x)+1, FALSE, TRUE)
which(ifelse(!is.na(lag(x)) & x == lag(x)+1, FALSE, TRUE))

I think there a couple solutions to this problem. One as d.b posted in the comment above that will produce a sequence that increments every time there is a break in the sequence.
cummax(c(1, diff(x)))
There is a similar solution that I chose to use with ifelse() flagging breaks and cumsum(). I chose this solution because additional information,like other vectors, can be included in the decision and diff seems to have problems with very erratic up and down values.
cumsum(ifelse(!is.na(lag(x)) & x == lag(x) + 1, FALSE, TRUE))

Related

Muliplying Elements of a Vector one more each time

I am trying to create a vector from another vector where I multiply the numbers in the vector one more each time.
For example if I had (1,2,3) the new vector would be (1, 1 x 2, 1 x 2 x 3)=(1,2,6)
I tried to create a loop for this as seen below. It seems to work for whole numbers but not decimals. I am not sure why.
x <- c(0.99,0.98,0.97,0.96,0.95)
for(i in 1:5){x[i]=prod(x[1:i])}
The result given is 0.9900000 0.9702000 0.9316831 0.8590845 0.7303385
which is incorrect as prod(x) = 0.8582777. Which is not the same as the last element of the vector.
Does anyone know why this is the case? Or have a suggestion for improvement in my code to get the correct answer.
test<-c(1,2,3)
cumprod(test)
[1] 1 2 6
As #akrun suggests, one can achieve the same with:
Reduce("*", test, accumulate = TRUE)

How do I count the number of pattern occurrences, if the pattern includes NA, in R?

I have a string of 0's, 1's and NA's like so:
string<-c(0,1,1,0,1,1,NA,1,1,0,1,1,NA,1,0,
0,1,0,1,1,1,NA,1,0,1,NA,1,NA,1,0,1,0,NA,1)
I'd like to count the number of times the PATTERN "1-NA-1" occurs. In this instance, I would like get the count 5.
I've tried table(string), and trying to replicate this but nothing seems to work. I would appreciate anyone's help!
# some ugly code, but it seems to work
sum( head(string, -2) == 1 & is.na(head(string[-1],-1))
& string[-1:-2] == 1, na.rm = TRUE)
Something like:
x <- which(is.na(string))
x <- x[!x %in% c(1,length(string))]
length(x[string[x-1] & string[x+1]])
# [1] 5
-- REASONING --
First, we check which values of string are NA with is.na(string). Then we find those indices with which and store them in x.
As #Rick mentions, if the first/last value is NA it would lead to problems in our next step. So, we make sure that those are removed (as it shouldn't count anyway).
Next, we want to find the situation where both string[x-1] and string[x+1] are 1. In other words, 1 & 1. Note that FALSE and TRUE can be evaluated as 0 and 1 respectively. So, if you type 1 == TRUE you will get TRUE. If you type 1 & 1 you will also get TRUE back. So, string[x-1] & string[x+1] will return TRUE when both are 1, and FALSE otherwise. We basically obtain a logical vector, and subset x with that vector to get all positions in x that satisfy our search. Then we use length to determine how many there are.

i not showing up as number in loop

so I have a loop that finds the position in the matrix where there is the largest difference in consecutive elements. For example, if thematrix[8] and thematrix[9] have the largest difference between any two consecutive elements, the number given should be 8.
I made the loop in a way that it will ignore comparisons where one of the elements is NaN (because I have some of those in my data). The loop I made looks like this.
thenumber = 0 #will store the difference
for (i in 1:nrow(thematrix) - 1) {
if (!is.na(thematrix[i]) & !is.na(thematrix[i + 1])) {
if (abs(thematrix[i] - thematrix[i + 1]) > thenumber) {
thenumber = i
}
}
}
This looks like it should work but whenever I run it
Error in if (!is.na(thematrix[i]) & !is.na(thematrix[i + 1])) { :
argument is of length zero
I tried this thing but with a random number in the brackets instead of i and it works. For some reason it only doesn't work when I use the i specified in the beginning of the for-loop. It doesn't recognize that i represents a number. Why doesn't R recognize i?
Also, if there's a better way to do this task I'd appreciate it greatly if you could explain it to me
You are pretty close but when you call i in 1:nrow(thematrix) - 1 R evaluates this to make i = 0 which is what causes this issue. I would suggest either calling i in 1:nrow(thematrix) or i in 2:nrow(thematrix) - 1 to start your loop at i = 1. I think your approach is generally pretty intuitive but one suggestion would be to frequently use the print() function to evaluate how i changes over the course of your function.
The issue is that the : operator has higher precedence than -; you just need to use parentheses around (nrow(thematrix)-1). For example,
thematrix <- matrix(1:10, nrow = 5)
##
wrong <- 1:nrow(thematrix) - 1
right <- 1:(nrow(thematrix) - 1)
##
R> wrong
#[1] 0 1 2 3 4
R> right
#[1] 1 2 3 4
Where the error message is coming from trying to access the zero-th element of thematrix:
R> thematrix[0]
integer(0)
The other two answers address your question directly, but I must say this is about the worst possible way to solve this problem in R.
set.seed(1) # for reproducible example
x <- sample(1:10,10) # numbers 1:10 in random order
x
# [1] 3 4 5 7 2 8 9 6 10 1
which.max(abs(diff(x)))
# [1] 9
The diff(...) function calculates sequential differences, and which.max(...) identifies the element number of the maximum value in a vector.

counting matching elements of two vectors but not including repeated elements in the count

I've search a lot in this forum. However, I didn't found a similar problem as the one I'm facing.
My question is:
I have two vectors
x <- c(1,1,2,2,3,3,3,4,4,4,6,7,8) and z <- c(1,1,2,4,5,5,5)
I need to count the number of times x or z appears in each other including if they are repeated or not.
The answer for this problem should be 4 because :
There are two number 1, one number 2, and one number 4 in each vector.
Functions like match() don't help since they will return the answer of repeated for non repeated numbers. Using unique() will also alter the final answer from 4 to 3
What I came up with was a loop that every time it found one number in the other, it would remove from the list so it won't be counted again.
The loop works fine for this size of this example; however, searching for larger vectors numerous times makes my loop inefficient and too slow for my purposes.
system.time({
for(n in 1:1000){
x <- c(1,1,2,2,3,3,3,4,4,4,6,7,8)
z <- c(1,1,2,4,5,5,5)
score <- 0
for(s in spectrum){
if(s %in% sequence){
sequence <- sequence[-which(sequence==s)[1]]
score <- score + 1
}
}
}
})
Can someone suggest a better method?
I've tried using lapply, for short vectors it is faster, but it became slower for longer ones..
Use R's vectorization to your advantage here. There's no looping necessary.
You could use a table to look at the frequencies,
table(z[z %in% x])
#
# 1 2 4
# 2 1 1
And then take the sum of the table for the total
sum(table(z[z %in% x]))
# [1] 4

Running sum on a column conditional on value

I have a vector of binary variables which state whether a product is on promotion in the period. I'm trying to work out how to calculate the duration of each promotion and the duration between promotions.
promo.flag = c(1,1,0,1,0,0,1,1,1,0,1,1,0))
So in other words: if promo.flag is same as previous period then running.total + 1, else running.total is reset to 1
I've tried playing with apply functions and cumsum but can't manage to get the conditional reset of running total working :-(
The output I need is:
promo.flag = c(1,1,0,1,0,0,1,1,1,0,1,1,0)
rolling.sum = c(1,2,1,1,1,2,1,2,3,1,1,2,0)
Can anybody shed any light on how to achieve this in R?
It sounds like you need run length encoding (via the rle command in base R).
unlist(sapply(rle(promo.flag)$lengths,seq))
Gives you a vector 1 2 1 1 1 2 1 2 3 1 1 2 1. Not sure what you're going for with the zero at the end, but I assume it's a terminal condition and easy to change after the fact.
This works because rle() returns a list of two, one of which is named lengths and contains a compact sequence of how many times each is repeated. Then seq when fed a single integer gives you a sequence from 1 to that number. Then apply repeatedly calls seq with the single numbers in rle()$lengths, generating a list of the mini sequences. unlist then turns that list into a vector.

Resources