Finding duplicate values in r - r

So, In a string containing multiple 1's,
Now, it is possible that, the number
'1'
appears at several positions, let's say, at multiple positions. What I want is
(3)

This is not a complete answer, but some ideas (partly based on comments):
z <- "1101101101"
zz <- as.numeric(strsplit(z,"")[[1]])
Compute autocorrelation function and draw plot: in this case I'm getting the periodicity=3 pretty crudely as the first point at which there is an increase followed by a decrease ...
a1 <- acf(zz)
first.peak <- which(diff(sign(diff(a1$acf[,,1])))==-2)[1]
Now we know the periodicity is 3; create runs of 3 with embed() and analyze their similarities:
ee <- embed(zz,first.peak)
pp <- apply(ee,1,paste,collapse="")
mm <- outer(pp,pp,"==")
aa <- apply(mm[!duplicated(mm),],1,which)
sapply(aa,length) ## 3 3 2 ## number of repeats
sapply(aa,function(x) unique(diff(x))) ## 3 3 3

The following code does exactly what you ask for. Try it with str_groups('1101101101'). It returns a list of 3-vectors. Note that the first triple is (1, 3, 4) because the character at the 10th position is also a 1.
Final version, optimized and without errors
str_groups <- function (s) {
digits <- as.numeric(strsplit(s, '')[[1]])
index1 <- which(digits == 1)
len <- length(digits)
back <- length(index1)
if (back == 0) return(list())
maxpitch <- (len - 1) %/% 2
patterns <- matrix(0, len, maxpitch)
result <- list()
for (pitch in 1:maxpitch) {
divisors <- which(pitch %% 1:(pitch %/% 2) == 0)
while (index1[back] > len - 2 * pitch) {
back <- back - 1
if (back == 0) return(result)
}
for (startpos in index1[1:back]) {
if (patterns[startpos, pitch] != 0) next
pos <- seq(startpos, len, pitch)
if (digits[pos[2]] != 1 || digits[pos[3]] != 1) next
repeats <- length(pos)
if (repeats > 3) for (i in 4:repeats) {
if (digits[pos[i]] != 1) {
repeats <- i - 1
break
}
}
continue <- F
for (subpitch in divisors) {
sublen <- patterns[startpos, subpitch]
if (sublen > pitch / subpitch * (repeats - 1)) {
continue <- T
break
}
}
if (continue) next
for (i in 1:repeats) patterns[pos[i], pitch] <- repeats - i + 1
result <- append(result, list(c(startpos, pitch, repeats)))
}
}
return(result)
}
Note: this algorithm has roughly quadratic runtime complexity, so if you make your strings twice as long, it will take four times as much time to find all patterns on average.
Pseudocode version
To aid understanding of the code. For particulars of R functions such as which, consult the R online documentation, for example by running ?which on the R command line.
PROCEDURE str_groups WITH INPUT $s (a string of the form /(0|1)*/):
digits := array containing the digits in $s
index1 := positions of the digits in $s that are equal to 1
len := pointer to last item in $digits
back := pointer to last item in $index1
IF there are no items in $index1, EXIT WITH empty list
maxpitch := the greatest possible interval between 1-digits, given $len
patterns := array with $len rows and $maxpitch columns, initially all zero
result := array of triplets, initially empty
FOR EACH possible $pitch FROM 1 TO $maxpitch:
divisors := array of divisors of $pitch (including 1, excluding $pitch)
UPDATE $back TO the last position at which a pattern could start;
IF no such position remains, EXIT WITH result
FOR EACH possible $startpos IN $index1 up to $back:
IF $startpos is marked as part of a pattern, SKIP TO NEXT $startpos
pos := possible positions of pattern members given $startpos, $pitch
IF either the 2nd or 3rd $pos is not 1, SKIP TO NEXT $startpos
repeats := the number of positions in $pos
IF there are more than 3 positions in $pos THEN
count how long the pattern continues
UPDATE $repeats TO the length of the pattern
END IF (more than 3 positions)
FOR EACH possible $subpitch IN $divisors:
check $patterns for pattern with interval $subpitch at $startpos
IF such a pattern is found AND it envelopes the current pattern,
SKIP TO NEXT $startpos
(using helper variable $continue to cross two loop levels)
END IF (pattern found)
END FOR (subpitch)
FOR EACH consecutive position IN the pattern:
UPDATE $patterns at row of position and column of $pitch TO ...
... the remaining length of the pattern at that position
END FOR (position)
APPEND the triplet ($startpos, $pitch, $repeats) TO $result
END FOR (startpos)
END FOR (pitch)
EXIT WITH $result
END PROCEDURE (str_groups)

Perhaps the following route will help:
Convert string to a vector of integers characters
v <- as.integer(strsplit(s, "")[[1]])
Repeatedly convert this vector to matrices of varying number of rows...
m <- matrix(v, nrow=...)
...and use rle to find relevant patterns in the rows of the matrix m:
rle(m[1, ]); rle(m[2, ]); ...

Related

Attempting a for loop in R

x <- 1:19
count <- 0
for (i in x) {
if atranspose * T5_5_FBEETLES[i, 3:6]>cutoff
count=count+1
}
print(count)
Hello, I am trying to do a for loop in R. In this for loop, I am multiplying a 1x4 matrix (atranspose in this case) and the third through sixth columns of a table (the table is T5_5_FBEETLES in this case) row by row (hence the i in x, so going through the first 19 rows) and I'm comparing it to a number with the variable name of cutoff. If the multiplication ends up with something greater than the cutoff number, I want count to increase by 1. I know from doing this by hand that by the end count should be 19, but for whatever reason my for loop returns 1 for my count variable and I keep getting these two errors:
unexpected symbol in:
"for (i in x) {
if atranspose"
unexpected '}' in "}"
Can anyone explain to me why these two errors are occurring, and how I can fix up my for loop so that it can return the correct count?
You are getting an error because your if statement crosses a line and thus needs some curly braces:
x <- 1:19
count <- 0
for (i in x) {
if (atranspose * T5_5_FBEETLES[i, 3:6]>cutoff) {
count=count+1
}
}
print(count)
This will then give you another error because the logical check of the if statement will return a vector, so it needs to be wrapped in an any:
x <- 1:19
count <- 0
for (i in x) {
if (any(atranspose * T5_5_FBEETLES[i, 3:6]>cutoff)) {
count=count+1
}
}
print(count)

Delete all rows in a file that fit between certain headers?

I would like to delete all of the rows that sit between certain headers in this example text file.
fileConn <- file("sample.txt")
one <- "*Keyword"
two <- "*Node"
three <- "$ Node,X,Y,Z"
four <- "1,639982.78040607,4733827.5104821,0"
five <- "2,639757.59709573,4733830.43494066,0"
six <- "3,639738.81268144,4733834.3619618,0"
seven <- "*End"
writeLines (c(one, two, three, four, five, six, seven), fileConn)
close(fileConn)
sample <- readLines("sample.txt")
What I am looking to do is delete all of the rows/lines between "*Node" and "*End". Since I am dealing with files with different lengths of rows between these headers, the deletion method needs to be based on headers only. I have no idea how to do this since I've only deleted rows in dataframes referenced by row numbers previously. Any clues?
Expected output is:
*Keyword
*Node
*End
readLines returns a vector, not a data frame, so we can create the sample input more simply:
sample = c("*Keyword",
"*Node",
"$ Node,X,Y,Z",
"1,639982.78040607,4733827.5104821,0",
"2,639757.59709573,4733830.43494066,0",
"3,639738.81268144,4733834.3619618,0",
"*End")
Find the starting and ending headers, and remove the elements in between with negative indexing:
node = which(sample == "*Node")
end = which(sample == "*End")
result = sample[-seq(from = node + 1, to = end - 1)]
result
# [1] "*Keyword" "*Node" "*End"
This assumes there is a single *Node and a single *End line. It also assumes that there is at least one line to delete. You may want to create a more robust solution with some handling for those special cases, e.g.,
delete_between = function(input, start, end) {
start_index = which(sample == start)
end_index = which(sample == end)
if (length(start_index) == 0 | length(end_index) == 0) {
warning("No start or end found, returning input as-is")
return(input)
}
if (length(start_index) > 1 | length(end_index) > 1) {
stop("Multiple starts or ends found.")
}
if (start_index == end_index - 1) {
return(input)
}
return(input[-seq(from = start_index + 1, to = end_index - 1)])
}

Quicksort in R - array sorted after k steps

I am pretty new to R, so there is definitely some improvement to my code needed. What I want is to do quicksort on an array of n elements, count the number of comparisons made and output the sorted array after k comparisons.
So far, I have reused the code for a quicksort algorithm found here:
quickSort <- function(arr) {
# Pick a number at random.
mid <- sample(arr, 1)
print(arr)
print(mid)
# Place-holders for left and right values.
left <- c()
right <- c()
# Move all the smaller values to the left, bigger values to the right.
lapply(arr[arr != mid], function(d) {
count <<- count + 1
stopifnot(count <= k)
if (d < mid) {
left <<- c(left, d)
}
else {
right <<- c(right, d)
}
})
if (length(left) > 1) {
left <- quickSort(left)
}
if (length(right) > 1) {
right <- quickSort(right)
}
# Finally, return the sorted values.
c(left, mid, right)
}
I am currently struggling with several things:
How can I get not only the partial vector that is currently being sorted but also the full vector?
Did I put the right stopping condition in the right place?
An example of what I want:
given an array (2,4,1,3,5) and the first pivot element 3, after four comparisons I would want the output to be (2,1,3,4,5).
Any help would be greatly appreciated!

Get out of infinite while loop

What is the best way to have a while loop recognize when it is stuck in an infinite loop in R?
Here's my situation:
diff_val = Inf
last_val = 0
while(diff_val > 0.1){
### calculate val from data subset that is greater than the previous iteration's val
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val) ### how much did this change val?
last_val = val ### set last_val for the next iteration
}
The goal is to have val get progressively closer and closer to a stable value, and when val is within 0.1 of the val from the last iteration, then it is deemed sufficiently stable and is released from the while loop. My problem is that with some data sets, val gets stuck alternating back and forth between two values. For example, iterating back and forth between 27.0 and 27.7. Thus, it never stabilizes. How can I break the while loop if this occurs?
I know of break but do not know how to tell the loop when to use it. I imagine holding onto the value from two iterations before would work, but I do not know of a way to keep values two iterations ago...
while(diff_val > 0.1){
val = foo(subset(data, col1 > last_val))
diff_val = abs(val - last_val)
last_val = val
if(val == val_2_iterations_ago) break
}
How can I create val_2_iterations_ago?
Apologies for the non-reproducible code. The real foo() and data that are needed to replicate the situation are not mine to share... they aren't key to figuring out this issue with control flow, though.
I don't know if just keeping track of the previous two iterations will actually suffice, but it isn't too much trouble to add logic for this.
The logic is that at each iteration, the second to last value becomes the last value, the last value becomes the current value, and the current value is derived from foo(). Consider this code:
while (diff_val > 0.1) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
}
Another approach, perhaps a little more general, would be to track your iterations and set a maximum.
Pairing this with Tim's nice answer:
iter = 0
max_iter = 1e6
while (diff_val > 0.1 & iter < max_iter) {
val <- foo(subset(data, col1 > last_val))
if (val == val_2_iterations_ago) break
diff_val = abs(val - last_val)
val_2_iterations_ago <- last_val
last_val <- val
iter = iter + 1
}
How this is generally done is that you have:
A convergence tolerance, so that when your objective function doesn't change appreciably, the algorithm is deemed to have converged
A limit on the number of iterations, so that the code is guaranteed to terminate eventually
A check that the objective function is actually decreasing, to catch the situation where it's diverging/cyclic (many optimisation algorithms are designed so this shouldn't happen, but in your case it does happen)
Pseudocode:
oldVal <- Inf
for(i in 1:NITERS)
{
val <- objective(x)
diffVal <- val - oldVal
converged <- (diffVal <= 0 && abs(diffVal) < TOL)
if(converged || diffVal > 0)
break
oldVal <- val
}

nested if statment in R

I'm trying to implement following thing in R, but I'm new in R and my code doesn't work.
I have matrix A, I did coordinates changes .
I want to write two function:
1) give the element of matrix, given coordinates
2) give the coordinates given number.
the pseudo code is right, the only problem is my syntax. can somebody correct it ?
f<- as.numeric(readline(prompt="Please enter 10 to get coordinate of number,and 20 to get the number > "));
if(p==10){
# give the number, given coordinates
i<- as.numeric(readline(prompt="Pleae enter i cordinate > "));
j<- as.numeric(readline(prompt="Pleae enter j cordinate > "));
if (i>0&j<0) return A[5+i,5+j]
if (i>0&j>0) return A[5+i,5+j]
if (i<0&j>0) return A[5+i,5-j]
if (i<0&j<0) return A[5+i,5-j]
}else if (p==20){
#give the cordinate, given number
coordinate <- which(A==number)
[i,j]<-A[2-coordinate[0],coordinate[1]-2]
}
}
Warning: what if i or j is equal to zero? Next, make a single variable which is the decimal representation of binary i,j, That is,
if(p==10){
x <- (i>0) + 2*(j>0) +1
# x takes on values 1 thru 5. This is because switch requires nonnegative integer
switch(x,
return A[5+i,5+j],
return A[5+i,5+j],
return A[5+i,5+j],
return A[5+i,5+j]) # change the +/- indices as desired
}else{
#etc.
And, finally, you should make this a function, not a collection of commands.
Edit - I skipped this before, but: you cannot call an index of 0 so you need to fix a number of things in the line [i,j]<-A[2-coordinate[0],coordinate[1]-2]
The syntax is as follows:
x <- 4
if (x == 1 | x == 2) print("YES")

Resources