Create index for contiguous runs of values - r

I have a vector:
test <-c(1,1,0,2,2,3,4,1,1,0)
test
# [1] 1 1 0 2 2 3 4 1 1 0
I want to construct an grouping variable which indicates when values change:
# [1] 1 1 2 3 3 4 5 6 6 7
What is the best way to do this?

Use run length encoding (rle), seq_along and rep
r <- rle(test)
changes <- rep(seq_along(r$lengths), r$lengths)
changes
## [1] 1 1 2 3 3 4 5 6 6 7

Alternative option, which will admittedly only work for numeric data.
test <-c(1,1,0,2,2,3,4,1,1,0)
cumsum(c(1L, diff(test) != 0))
# [1] 1 1 2 3 3 4 5 6 6 7
And a convoluted variation that will work for any data types:
head(cumsum(c(TRUE, c(tail(test, -1), NA) != test)), -1)
# [1] 1 1 2 3 3 4 5 6 6 7

Related

Transforming a looping factor variable into a sequence of numerics

I have a factor variable with 6 levels, which simplified looks like:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 1 1 1 2 2 2 2... 1 1 1 2 2... (with n = 78)
Note, that each number is repeated mostly but not always three times.
I need to transform this variable into the following pattern:
1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8...
where each repetition of the 6 levels continuous counting ascending.
Is there any way / any function that lets me do that?
Sorry for my bad description!
Assuming that you have a numerical vector that represents your simplified version you posted. i.e. x = c(1,1,1,2,2,3,3,3,1,1,2,2), you can use this:
library(dplyr)
cumsum(x != lag(x, default = 0))
# [1] 1 1 1 2 2 3 3 3 4 4 5 5
which compares each value to its previous one and if they are different it adds 1 (starting from 1).
Maybe you can try rle, i.e.,
v <- rep(seq_along((v<-rle(x))$values),v$lengths)
Example with dummy data
x = c(1,1,1,2,2,3,3,3,4,4,5,6,1,1,2,2,3,3,3,4,4)
then we can get
> v
[1] 1 1 1 2 2 3 3 3 4 4 5 6 7 7 8 8 9 9
[19] 9 10 10
In base you can use diff and cumsum.
c(1, cumsum(diff(x)!=0)+1)
# [1] 1 1 2 2 2 3 3 3 4 4 4 4 5 5 5 6 6 6 7 7 7 8 8 8 8
Data:
x <- c(1,1,2,2,2,3,3,3,4,4,4,4,5,5,5,6,6,6,1,1,1,2,2,2,2)

subset.data.frame in R

I have a data frame of raw data:
raw <- data.frame(subj = c(1,1,1,2,2,2,3,3,3,4,4,4),
blah = c(0,0,0,1,1,1,1,0,1,0,0,0))
From it, I want to remove the bad subj.
badsubj <- c(1,4)
trim <- subset.data.frame(raw, subj != badsubj)
But for some reason, all the badsubj values are not removed:
subj blah
2 1 0
4 2 1
5 2 1
6 2 1
7 3 1
8 3 0
9 3 1
11 4 0
What am I doing wrong? Obersvations 2 and 11 should be excluded because they are members of badsubj.
raw[!raw$subj %in% badsubj, ]
wrong use of !=
The problem is that subj and badsubj do not have the same length. Therefore badsubj will be recycled until both vectors have the same length. Then your code compares elementwise the values in the output below.
subj badsubj
1 1 1
2 1 4
3 1 1
4 2 4
5 2 1
6 2 4
7 3 1
8 3 4
9 3 1
10 4 4
11 4 1
12 4 4

changing values in dataframe in R based on criteria

I have a data frame that looks like
> mydata
ID Observation X
1 1 3
1 2 3
1 3 3
1 4 3
2 1 4
2 2 4
3 1 8
3 2 8
3 3 8
I have some code that counts the number of observations per ID, determines which IDs have a number of observations that meet a certain criteria (in this case, >=3 observations), and returns a vector with these IDs:
> vals
[1] 1 3
Now I want to manipulate the X values associated with these IDs, e.g. by adding 1 to each value, giving a data frame like this:
> mydata
ID Observation X
1 1 4
1 2 4
1 3 4
1 4 4
2 1 4
2 2 4
3 1 9
3 2 9
3 3 9
I'm pretty new to R and am uncertain how I might do this. It might help to know that X is constant for each ID.
The call mydata$ID %in% vals returns TRUE or FALSE to indicate whether the ID value for each row is in the vals vector. When you add this to the data currently in mydata$X, the TRUE and FALSE are converted to 1 and 0, respectively, yielding the desired result:
mydata$X <- mydata$X + mydata$ID %in% vals
# mydata
# ID Observation X
# 1 1 1 4
# 2 1 2 4
# 3 1 3 4
# 4 1 4 4
# 5 2 1 4
# 6 2 2 4
# 7 3 1 9
# 8 3 2 9
# 9 3 3 9

remove i+1th term if reoccuring

Say we have the following data
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
How would one write a function so that for A, if we have the same value in the i+1th position, then the reoccuring row is removed.
Therefore the output should like like
data.frame(c(1,2,3,4,8,6,1,2,3,4), c(1,2,5,1,2,3,5,1,2,3))
My best guess would be using a for statement, however I have no experience in these
You can try
data[c(TRUE, data[-1,1]!= data[-nrow(data), 1]),]
Another option, dplyr-esque:
library(dplyr)
dat1 <- data.frame(A=c(1,2,2,2,3,4,8,6,6,1,2,3,4),
B=c(1,2,3,4,5,1,2,3,4,5,1,2,3))
dat1 %>% filter(A != lag(A, default=FALSE))
## A B
## 1 1 1
## 2 2 2
## 3 3 5
## 4 4 1
## 5 8 2
## 6 6 3
## 7 1 5
## 8 2 1
## 9 3 2
## 10 4 3
using diff, which calculates the pairwise differences with a lag of 1:
data[c( TRUE, diff(data[,1]) != 0), ]
output:
A B
1 1 1
2 2 2
5 3 5
6 4 1
7 8 2
8 6 3
10 1 5
11 2 1
12 3 2
13 4 3
Using rle
A <- c(1,2,2,2,3,4,8,6,6,1,2,3,4)
B <- c(1,2,3,4,5,1,2,3,4,5,1,2,3)
data <- data.frame(A,B)
X <- rle(data$A)
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
View(data[Y, ])
row.names A B
1 1 1 1
2 2 2 2
3 5 3 5
4 6 4 1
5 7 8 2
6 8 6 3
7 10 1 5
8 11 2 1
9 12 3 2
10 13 4 3

How do I split a vector into a list of vectors when a condition is met?

I would like to split a vector of into a list of vectors. The resulting vectors will be of variable length, and I need the split to occur only when certain conditions are met.
Sample data:
set.seed(3)
x <- sample(0:9,100,repl=TRUE)
For example, in this case I would like to split the above vector x at each 0.
Currently I do this with my own function:
ConditionalSplit <- function(myvec, splitfun) {
newlist <- list()
splits <- which(splitfun(x))
if (splits == integer(0)) return(list(myvec))
if (splits[1] != 1) newlist[[1]] <- myvec[1:(splits[1]-1)]
i <- 1
imax <- length(splits)
while (i < imax) {
curstart <- splits[i]
curend <- splits[i+1]
if (curstart != curend - 1)
newlist <- c(newlist, list(myvec[curstart:(curend-1)]))
i <- i + 1
}
newlist <- c(newlist, list(myvec[splits[i]:length(vector)]))
return(newlist)
}
This function gives the output I'd like, but I'm certain there's a better way than mine.
> MySplit <- function(x) x == 0
> ConditionalSplit(x, MySplit)
[[1]]
[1] 1 8 3 3 6 6 1 2 5 6 5 5 5 5 8 8 1 7 8 2 2
[[2]]
[1] 0 1
[[3]]
[1] 0 2 7 5 9 5 7 3 3 1 4 2 3 8 2 5 2 2 7 1 5 4 2
...
The following line seems to work just fine:
split(x,cumsum(x==0))
Another solution is to use tapply. A good reason to use tapply instead of split is because it lets you perform other operations on the items in the list while you're splitting it.
For example, in this solution to the question:
> x <- sample(0:9,100,repl=TRUE)
> idx <- cumsum(x==0)
> splitList <- tapply(x, idx, function(y) {list(y)})
> splitList
$`0`
[1] 2 9 2
$`1`
[1] 0 5 5 3 8 4
$`2`
[1] 0 2 5 2 6 2 2
$`3`
[1] 0 8 1 7 5
$`4`
[1] 0 1 6 6 3 8 7 2 4 2 3 1
$`5`
[1] 0 6 8 9 9 1 1 2
$`6`
[1] 0 1 2 2 2 7 8 1 9 7 9 3 4 8 4 6 4 5 3 1
$`7`
[1] 0 2 7 8 5
$`8`
[1] 0 3 4 8 4 7 3
$`9`
[1] 0 8 4
$`10`
[1] 0 4 3 9 9 8 7 4 4 5 5 1 1 7 3 9 7 4 4 7 7 6 3 3
Can be modified so that you divide each element by the number of elements in that list.
list(y/length(y))
instead of
list(y)

Resources