Find unique set of strings in vector where vector elements can be multiple strings - r

I have a series of batch records that are labeled sequentially. Sometimes batches overlap.
x <- c("1","1","1/2","2","3","4","5/4","5")
> data.frame(x)
x
1 1
2 1
3 1/2
4 2
5 3
6 4
7 5/4
8 5
I want to find the set of batches that are not overlapping and label those periods. Batch "1/2" includes both "1" and "2" so it is not unique. When batch = "3" that is not contained in any previous batches, so it starts a new period. I'm having difficulty dealing with the combined batches, otherwise this would be straightforward. The result of this would be:
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
My experience is in more functional programming paradigms, so I know the way I did this is very un-R. I'm looking for the way to do this in R that is clean and simple. Any help is appreciated.
Here's my un-R code that works, but is super clunky and not extensible.
x <- c("1","1","1/2","2","3","4","5/4","5")
p <- 1 #period number
temp <- NULL #temp variable for storing cases of x (batches)
temp[1] <- x[1]
period <- NULL
rl <- 0 #length to repeat period
for (i in 1:length(x)){
#check for "/", split and add to temp
if (grepl("/", x[i])){
z <- strsplit(x[i], "/") #split character
z <- unlist(z) #convert to vector
temp <- c(temp, z, x[i]) #add to temp vector for comparison
}
#check if x in temp
if(x[i] %in% temp){
temp <- append(temp, x[i]) #add to search vector
rl <- rl + 1 #increase length
} else {
period <- append(period, rep(p, rl)) #add to period vector
p <- p + 1 #increase period count
temp <- NULL #reset
rl <- 1 #reset
}
}
#add last batch
rl <- length(x) - length(period)
period <- append(period, rep(p,rl))
df <- data.frame(x,period)
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3

R has functional paradigm influences, so you can solve this with Map and Reduce. Note that this solution follows your approach in unioning seen values. A simpler approach is possible if you assume batch numbers are consecutive, as they are in your example.
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
r<-Reduce(union,s,init=list(),acc=TRUE)
p<-cumsum(Map(function(x,y) length(intersect(x,y))==0,s,r[-length(r)]))
data.frame(x,period=p)
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3
What this does is first calculate a cumulative union of seen values. Then, it maps across this to determine the places where none of the current values have been seen before. (Alternatively, this second step could be included within the reduce, but this would be wordier without support for destructuring.) The cumulative sum provides the "period" numbers based on the number of times the intersections have come up empty.
If you do make the assumption that the batch numbers are consecutive then you can do the following instead
x <- c("1","1","1/2","2","3","4","5/4","5")
s<-strsplit(x,"/")
n<-mapply(function(x) range(as.numeric(x)),s)
p<-cumsum(c(1,n[1,-1]>n[2,-ncol(n)]))
data.frame(x,period=p)
For the same result (not repeated here).

A little bit shorter:
x <- c("1","1","1/2","2","3","4","5/4","5")
x<-data.frame(x=x, period=-1, stringsAsFactors = F)
period=0
prevBatch=-1
for (i in 1:nrow(x))
{
spl=unlist(strsplit(x$x[i], "/"))
currentBatch=min(spl)
if (currentBatch<prevBatch) { stop("Error in sequence") }
if (currentBatch>prevBatch)
period=period+1;
x$period[i]=period;
prevBatch=max(spl)
}
x

Here's a twist on the original that uses tidyr to split the data into two columns so it's easier to use:
# sample data
x <- c("1","1","1/2","2","3","4","5/4","5")
df <- data.frame(x)
library(tidyr)
# separate x into two columns, with second NA if only one number
df <- separate(df, x, c('x1', 'x2'), sep = '/', remove = FALSE, convert = TRUE)
Now df looks like:
> df
x x1 x2
1 1 1 NA
2 1 1 NA
3 1/2 1 2
4 2 2 NA
5 3 3 NA
6 4 4 NA
7 5/4 5 4
8 5 5 NA
Now the loop can be a lot simpler:
period <- 1
for(i in 1:nrow(df)){
period <- c(period,
# test if either x1 or x2 of row i are in any x1 or x2 above it
ifelse(any(df[i, 2:3] %in% unlist(df[1:(i-1),2:3])),
period[i], # if so, repeat the terminal value
period[i] + 1)) # else append the terminal value + 1
}
# rebuild df with x and period, which loses its extra initializing value here
df <- data.frame(x = df$x, period = period[2:length(period)])
The resulting df:
> df
x period
1 1 1
2 1 1
3 1/2 1
4 2 1
5 3 2
6 4 3
7 5/4 3
8 5 3

Related

New Column that is 1-n for each new factor level

I am looking for advice on how to add a column to a dataframe that for each factor level counts from 1 to n the number of factors in that level. Here is the example I am working with.
collatzRule <- function(m){
if ( m %% 2 == 0) {
return(m/2)
} else {
return(3*m + 1)
}
}
collatz <- function(n, limit = 1000000000) {
# collatz numbers will go in this vector
numbers <- numeric(limit)
# keep count of how many numbers we have made:
counter <- 0
while ( n > 1 & counter < limit) {
# need to make a new number
counter <- counter + 1
# put the current number into the vector
numbers[counter] <- n
# make next Collatz number
n <- collatzRule(n)
}
# find how many Collatz numbers we made:
howMany <- min(counter, limit)
# print them out:
print(numbers[1:howMany])
}
datalist = list()
for (i in 2:100) {
# ... make some data
dat <- collatz(i) %>% as.data.frame()
dat$i <- i %>% as.factor() # maybe you want to keep track of which iteration produced it?
datalist[[i]] <- dat # add it to your list
}
big_data = do.call(rbind, datalist)
This produces a data frame that looks likes:
. i
1 2 2
2 3 3
3 10 3
4 5 3
5 16 3
6 8 3
I want to add a column that looks like this.
. i x
1 2 2 1
2 3 3 1
3 10 3 2
4 5 3 3
5 16 3 4
6 8 3 5
Any help would be greatly appreciated!
Best wishes,

add reversed indices based on indicator

I have a vector like this
v <- c(0,0,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0)
I now want to generate a second vector that counts backwards until it hits a 1, then starts over.
The result here would be
r <- c(6,5,4,3,2,1,8,7,6,5,4,3,2,1,4,3,2,1,0)
the last zero should be kept
I tried something like this but cannot get it to work:
lv <- c(1, which(v == 1))
res <- c()
for(i in 1:(length(lv)-1)) {
res <- c(res, rev(lv[i]:lv[i+1]))
}
We can use ave creating groups with cumsum and count the sequence in reverse in each group. We then re assign 1 to their original position in new_seq.
new_seq <- ave(v, cumsum(v==1), FUN = function(x) rev(seq_along(x))) + 1
new_seq[v == 1] <- 1
new_seq
#[1] 6 5 4 3 2 1 8 7 6 5 4 3 2 1 4 3 2 1 2
Update
To keep everything after last 1 as it is we can do
#Make groups
indx <- cumsum(v==1)
#Create reverse sequential counting in each group
new_seq <- ave(v, indx, FUN = function(x) rev(seq_along(x))) + 1
#Keep everything after last 1 as it is
new_seq[which.max(indx) : length(v)] <- v[which.max(indx) : length(v)]
#Change 1's same as their original position
new_seq[v == 1] <- 1
new_seq
#[1] 6 5 4 3 2 1 8 7 6 5 4 3 2 1 4 3 2 1 0

Avoid Loop in Slicing Operation

I have the following code that I execute using a for loop. Is there a way to accomplish the same without a for loop?
first_list <- c(1,2,3, rep(1,5), rep(2,5), rep(3,5), rep(4,5))
print(first_list)
[1] 1 2 3 1 1 1 1 1 2 2 2 2 2
[1] 3 3 3 3 3 4 4 4 4 4
breaks <- c(rep(1,3), rep(5,4))
values <- vector()
i <- 1
prev <- 1
for (n in breaks){
values[i] <- sum(first_list[prev:sum(breaks[1:i])])
i <- i + 1
prev <- prev + n
}
print(values)
[1] 1 2 3 5 10 15 20
The purpose of the loop is to take the first three elements of a list, then add to that list the sum of the next four sets of 5.
You can use tapply for grouped operation
tapply(first_list, rep(1:length(breaks), breaks), sum)
or, preferably, using data.table
library(data.table)
data.table(first_list, id=rep(1:length(breaks), breaks))[, sum(first_list), id]$V1
If you have to perform it on your data as in your original post
setDT(mydata)
mydata[, id:=rep(1:length(breaks), breaks),][, sum(Freq), by=id]

Store every value in a sequence except some values

If I do the following to a string of letters:
x <- 'broke'
y <- nchar(x)
z <- sequence(y)
How do I store every value of the z that isn't the first, last, or middle values of the sequence.
In this example if z is 1 2 3 4 5 then the desired output would be 2 4
in the case of 1 2 3 4 nothing would be stored however, In the case of say 1 2 3 4 5 6 , 2 and 5 would be stored and so on
if (length(z) %% 2) {
z[-c(1, ceiling(length(z)/2), length(z))]
} else
z[-c(1, c(1,0) + floor(length(z)/2), length(z))]

Efficient way to count the change of values between 2 or more matrix or vectors

I am checking the change that occurs between different datasets, for now I am using a simple loop that gives me the counts for each change. The datasets are numeric(a sequence of numbers) and I count how many times each change occurs (1 changed to 5 XX times):
n=100
tmp1<-sample(1:25, n, replace=T)
tmp2<-sample(1:25, n, replace=T)
values_tmp1=sort(unique(tmp1))
values_tmp2=sort(unique(tmp2))
count=c()
i=1
for (m in 1:length(values_tmp1)){
for (j in 1:length(values_tmp2)){
count[i]=length(which(tmp1==values_tmp1[m] & tmp2==values_tmp2[j]))
i=i+1
}
}
However my data is much bigger with n = 2000000 , and the loop gets extremely slow.
Can anyone help me improve this calculation?
Like this?
tmp1 <- c(1:5,3)
tmp2 <- c(1,3,3,1,5,3)
aggregate(tmp1,list(tmp1,tmp2),length)
# Group.1 Group.2 x
# 1 1 1 1
# 2 4 1 1
# 3 2 3 1
# 4 3 3 2
# 5 5 5 1
This might be faster for a big dataset:
library(data.table)
DT <- data.table(cbind(tmp1,tmp2),key=c("tmp1","tmp2"))
DT[,.N,by=key(DT)]
# tmp1 tmp2 N
# 1: 1 1 1
# 2: 2 3 1
# 3: 3 3 2
# 4: 4 1 1
# 5: 5 5 1

Resources