Finding and counting repeated occurrences - r

I wish to make a function, which will accept three arguments (starting position, ending position, length), and by that function, I wish to find out, how many times each of the different pattern of that particular length appear and then I wish to extract the maximum one. Sounds confusing.

Try this:
countSubstring<-function(string,start,end,len) {
startChar<-seq(start,end,by=len)
table(substring(string,startChar,startChar+(len-1)))
}
string<-"aabaaaabaaaacaaaabaaaabaa"
countSubstring(string,start=1,end=15,len=5)
aabaa aacaa
2 1

Related

Find the index of the last occurence of fulfilled criteria in a matrix in r

I have an array (x) in R of size 30x11x10.
x=array(-2:20, c(30,11,10))
Each 'grid' or matrix represents a day of data for a month (30 days represented here). I want to find the index (i,j,k) of when the last occurrence of a number less than 2 occurs. Ideally, I would also like the value returned too. If this was in Matlab, I could just use [i,j,k]=find(x(x<2)) but I don't see an exact equivalent for this in R.
I have looked at 'match' as suggested in other posts here, but it seems to find elements when they are specified, but not when a criteria (x<2) is given?
I tried this:
xxx<-match(x,x<2,0) but it returns a long vector of integers that don't appear to show what I am looking for.
Then I tried:xxx<-match(x,x[x<2],0) which looks a bit more promising, but still isn't what I want (to be honest I'm not sure what the output is indexing).
I think I'm probably asking a foolish question here because if I want 3 indices and the value returned, then I should be assigning them to something preemptively right (which I'm not doing)? Can anyone offer any advice?

Compute nearly equal pattern of a string

Find near duplicate string. Hi, I know there is a match, unique, duplicated function in R, but none of these does wha I'm really need. I've a unique column in my dataset that I need to go trough it to check if the number are nearly the same. For instance, the first element compared with the second has nearly equal pattern, except for the number '9'. The second compared with the third is nearly equal, except for the last number o the sequence, one is ending with 6 while other ending with 5. Lastly, the two last numbers are 100% equal. If I've used unique() function, only the last case would be correctly excluded.
I'm wondering if there is a function that I can flag nearly equal, maybe calculating the percentage of equality, so I can drive my attention to those cases with highly equality rate.
dat <- data.frame(text = c("87775956",
"987775956",
"987775955",
"987481732",
"987481732"))

rle command counting changes in vector

n <- length(rle(sign(z)))
z contains 1 and -1. n should indicate the number of how many times the sign of z changes.
The code above does not lead to the desired outcome. If I expand the command to
length(rle(sign(z))[[1]])
it works. I don't understand the underlying mechanism of how [[1]] solves the problem?
rle returns a list consisting of two components: lengths, and values. As such, its own length is always 2. By contrast, you want to know the length of either of those components (they obviously have the same length). So either length(rle(…)[[1]]) or length(rle(…)[[2]]) would work. Better to use the names instead of an index though, e.g.
length(rle(z)$lengths)
However, this won’t be the number of times the sign changes; rather, it will be the number of times the changes plus 1.

HW assignment for learning R from scratch

So I am taking a course that requires learning R and I am struggling with one of the questions:
In this question, you will practice calling one function from within another function. We will estimate the probability of rolling two sixes by simulating dice throws. (The correct probability to four decimal places is 0.0278, or 1 in 36).
(1) Create a function roll.dice() that takes a number ndice and returns the result of rolling ndice number of dice. These are six-sided dice that can return numbers between 1 and 6. For example roll.dice(ndice=2) might return 4 6. Use the sample() function, paying attention to the replace option.
(2) Now create a function prob.sixes() with parameter nsamples, that first sets j equal to 0, and then calls roll.dice() multiple times (nsample number of times). Every time that roll.dice() returns two sixes, add one to j. Then return the probability of throwing two sixes, which is j divided by nsamples.
I am fine with part one, or at least I think so, so this is what I have
roll.dice<-function(ndice)
{
roll<-sample(1:6,ndice,TRUE)
return(roll)
}
roll.dice(ndice=2)
but I am struggling with part two. This is what I have so far:
prob.sixes<-function(nsamples) {
j<-vector
j<-0
roll.dice(nsamples)
if (roll.dice==6) {
j<-j+1
return(j)
}
}
prob.sixes(nsamples=3)
Sorry for all the text, but can anybody help me?
Your code has a couple of problems that I can see. The first one is the interpretation of the question. The question says:
Now create a function prob.sixes() with parameter nsamples, that first sets j equal to 0, and then calls roll.dice() multiple times (nsample number of times).
Check on your code, are you doing this? Or are you calling roll.dice() a single time? Look for ways to do the same thing (in your case, roll.dice) several times; you may consider the function for. Also, here, you need to store the result of this function on a variable, something like
rolled = roll.dice(2)
Second problem:
Every time that roll.dice() returns two sixes, add one to j.
You are checking if roll.dice==6. But this has two problems. First, roll.dice is a function, not a variable. So it will never be equal to 6. Also, you don't want to check if this variable is equal to six. You should ask whether this variable is equal to a pair of sixes. How can you write "a pair of sixes"?

Searching an ordered "list" matching condition when nothing matches the condition, list length = 1

I have a sorted list with 3 columns, and I'm searching to see if the second column matches 2 or 4, then returning the first column's element if so, and putting that into a function.
noOutliers((L1LeanList[order(L1LeanList[,1]),])[(L1LeanList[order(L1LeanList[,1]),2]==2)|
(L1LeanList[order(L1LeanList[,1]),2]==4),1])
when nothing matches the condition. I get a
Error in ((L1LeanList[order(L1LeanList[, 1]), ])[1, ])[(L1LeanList[order(L1LeanList[, :
incorrect number of dimensions
due to the fact that we effectively have List[List[all false]]
I can't just sub out something like L1LLSorted<-(L1LeanList[order(L1LeanList[,1]),]
and use L1LLSorted[,2] since this returns an error when the list is of length exactly 1
so now my code would need to look like
noOutliers(ifelse(any((L1LeanList[order(L1LeanList[,1]),2]==2)|
(L1LeanList[order(L1LeanList[,1]),2]==4)),0,
(L1LeanList[order(L1LeanList[,1]),])[(L1LeanList[order(L1LeanList[,1]),2]==2)|
(L1LeanList[order(L1LeanList[,1]),2]==4),1])))
which seems a bit ridiculous for the simple thing I'm requesting.
while writing this I realized that I can end up putting all this error checking into the noOutliers function itself so it looks like
noOutliers(L1LeanList,2,2,4) which will look much better, a necessity since slightly varying versions of this appear in my code dozens of times. I can't help but wonder, still, if theres a more elegant way to write the actual function.
for the curious, noOutliers finds a mean of the 30th-70th percentile in the sorted data set like so
noOutliers<-function(oList)
{
if (length(oList)<=20) return ("insufficient data")
cumSum<-0
iterCount<-0
for(i in round(length(oList)*3/10-.000001):round(length(oList)*7/10+.000001)+1)#adjustments deal with .5->even number rounding r mishandling
{ #and 1-based indexing (ex. for a list 1-10, taking 3-7 cuts off 1,2,8,9,10, imbalanced.)
cumSum<-cumSum+oList[i]
iterCount<-iterCount+1
}
return(cumSum/iterCount)
}
Let's see...
foo <- bar[(bar[,2]==2 | bar[,2]==4),1]
should extract all the first-column values you want. Then run whatever function you want on foo perhaps with the caveat "if (length(foo) < 1) then {exit, or skip, or something} "

Resources