Determine when a sequence of numbers has been broken in R - r

Say I have a series of numbers:
seq1<-c(1:20,25:40,48:60)
How can I return a vector that lists points in which the sequence was broken, like so:
c(21,24)
[1] 21 24
c(41,47)
[1] 41 47
Thanks for any help.
To show my miserably failing attempt:
nums<-min(seq1):max(seq1) %in% seq1
which(nums==F)[1]
res.vec<-vector()
counter<-0
res.vec2<-vector()
counter2<-0
for (i in 2:length(seq1)){
if(nums[i]==F & nums[i-1]!=F){
counter<-counter+1
res.vec[counter]<-seq1[i]
}
if(nums[i]==T & nums[i-1]!=T){
counter2<-counter2+1
res.vec2[counter2]<-seq1[i]
}
}
cbind(res.vec,res.vec2)

I have changed the general function a bit so I think this should be a sepparate answer.
You could try
seq1<-c(1:20,25:40,48:60)
myfun<-function(data,threshold){
cut<-which(c(1,diff(data))>threshold)
return(cut)
}
You get the points you have to care about using
myfun(seq1,1)
[1] 21 37
In order to better use is convenient to create an object with it.
pru<-myfun(seq1,1)
So you can now call
df<-data.frame(pos=pru,value=seq1[pru])
df
pos value
1 21 25
2 37 48
You get a data frame with the position and the value of the brakes with your desired threshold. If you want a list instead of a data frame it works like this:
list(pos=pru,value=seq1[pru])
$pos
[1] 21 37
$value
[1] 25 48

Function diff will give you the differences between successive values
> x <- c(1,2,3,5,6,3)
> diff(x)
[1] 1 1 2 1 -3
Now look for those values that are not equal to one for "breakpoints" in your sequence.
Taking in account the comments made here. For a general purpose, you could use.
fun<-function(data,threshold){
t<-which(c(1,diff(data)) != threshold)
return(t)
}
Consider that data could be any numerical vector (such as a data frame column). I would also consider using grep with a similar approach but it all depends on user preference.

Related

How to apply a function in different ranges of a vectror in R?

I have the following matrix:
x=matrix(c(1,2,2,1,10,10,20,21,30,31,40,
1,3,2,3,10,11,20,20,32,31,40,
0,1,0,1,0,1,0,1,1,0,0),11,3)
I would like to find for each unique value of the first column in x, the maximum value (across all records having that value of the first column in x) of the third column in x.
I have created the following code:
v1 <- sequence(rle(x[,1])$lengths)
A=split(seq_along(v1), cumsum(v1==1))
A_diff=rep(0,length(split(seq_along(v1), cumsum(v1==1))))
for( i in 1:length(split(seq_along(v1), cumsum(v1==1))) )
{
A_diff[i]=max(x[split(seq_along(v1), cumsum(v1==1))[[i]],3])
}
However, the provided code works only when same elements are consecutive in the first column (because I use rle) and I use a for loop.
So, how can I do it to work generally without the for loop as well, that is using a function?
If I understand correctly
> tapply(x[,3],x[,1],max)
1 2 10 20 21 30 31 40
1 1 1 0 1 1 0 0
For grouping more than 1 variable I would do aggregate, note that matrices are cumbersome for this purpose, I would suggest you transform it to a data frame, nonetheless
> aggregate(x[,3],list(x[,1],x[,2]),max)

Why the for loop is not using the 'i' specified in the function

I have a data frame with 25 weeks of observations per animal and 20 animals in total. I am trying to write a function that calculates a linear equation between 2 points each time and do that for the 25 weeks and the 20 animals.
I want to use a general form of the equation so I can calculate values al any point. In the function, Week=t, Weight=d.
I can't figure out how to make this work. I don't think the loop is working using each row of the data frame as the index for the function. My data frame named growth looks something like this:
Week Weight Animal
1 50 1
2 60 1
n=25
1 80 2
2 90 2
.
.
20
for (i in growth$Week){
eq<- function(t){
d = growth$BW.Kg
t = growth$Week
(d[i+1]-d[i])/(t[i+1]-t[i])*(t-t[i])+d[i]
return(eq)
}
}
eq(3)
OK, so I think there are a few points of confusion here. The first is writing a function inside a for loop. What is happening is that you are re-writing the function over and over, and also your function doesn't save the values of your equation anywhere. Secondly, you are passing t as your argument but the expecting t to follow the for loop with the i value. Finally, you say that you want this to be done for each animal, but the animal value is not shown in your code.
So it's a little bit hard to see what you are trying to achieve here.
Based on your information above, I've rewritten your function into something that will provide a result for your equation.
library(tidyverse)
growth <- tibble(week = 1:5,
animal = 1,
weight = c(50,52,55,54,57))
eq <- function(d,t,i){
z <- (d[i+1]-d[i])/(t[i+1]-t[i])*(t-t[i])+d[i]
return(z)
}
test_result <- eq(growth$weight,growth$week,3)
Results:
[1] 57 56 55 54 53
Is that the kind of result you were expecting? Or did you want just a single result per week per animal? Could you provide a working example of a formula that would produce a single desired result (i.e. a result for animal 1 on week 1)?

R Searching for elements and their index in an array

I have a matrix with 2 columns as described below:
TIME PRICE
10 45
11 89
13 89
15 12
16 09
17 34
19 89
20 90
23 21
26 09
in the above matrix, I need to iterate through the TIME column adding 5 seconds and accessing the corresponding PRICE that matches the row.
For ex: I start with 10. i need to access 15 (10+5), I would've been able to get to 15 easily if the numbers in the column were continuous data, but its not. so at 15 seconds time, i need to get hold of the corresponding price. and this goes on till the end of the entire data set. my next element that needs to be accessed is 20, and its corresponding price. now i again add 5 seconds and it hence goes on. incase the element is not present, the one immediately greater than it must be accessed to obtain the corresponding price.
If the rows you want to extract are m[1,1]+5, m[1,1]+10, m[1,1]+15 etc then:
m <- cbind(TIME=c(10,11,13,15,16,17,19,20,23,26),
PRICE=c(45,89,89,12,9,34,89,90,21,9))
r <- range(m[,1]) # 10,26
r <- seq(r[1]+5, r[2], 5) # 15,20,25
r <- findInterval(r-1, m[,1])+1 # 4,8,10 (values 15,20,26)
m[r,2] # 12,90,9
findInterval finds the index for values that are equal or less than the given value, so I give it a smaller value and then add 1 to the index.
Breaking the question apart into sub-pieces...
Getting the row with value 15:
Call your Matrix, say, DATA, and
[1] extract the row of interest:
DATA[DATA[,1] == 15, ]
Then snag the second column.
[2] Adding 5 to the first column ( I'm pretty sure you can just do this ):
DATA[,1] = DATA[,1] + 5
This should get you started. The rest seems to just be some funky iteration, incrementing by 5, using [1] to get the price you want each time, swapping 15 for some variable.
I leave the rest of the solution as an exercise to the reader. For tips on looping in R, and more, see the below tutorial ( I don't expect it to be taken down any time soon, but may want to keep a local copy. Good luck :) )
http://www.stat.berkeley.edu/users/vigre/undergrad/reports/VIGRERintro.pdf
As #Tommy commented above, it is not clear what TIME you exactly want to get. For me, it seems like you want to get the PRICE for the sequence 10,15,20,25,... If true, you could do that easily suing the mod (%%) function:
TIME <- c(10,11,13,15,16,17,19,20,23,26) # Your times
PRICE <- c(45,89,89,12,9,34,89,90,21,9) # your prices
PRICE[TIME %% 5 == 0] # Get prices from times in sequence 10, 15, 20, ...

Using Plyr in R with a complex function that returns multiple variable

I have a data set with three grouping variables: condition, sub, & delay. Here is a simplified version of my data (real data is much longer)
sub condition delay later_value choiceRT later_choice primeRT cue
10 SIZE 10 27 1832 1 888 CHILD
10 PAST 5 11 298 0 1635 PANTS
10 SIZE 21 13 456 0 949 CANDY
11 SIZE 120 22 526 1 7963 BOY
11 FUTURE 120 27 561 1 4389 CHILDREN
11 PAST 5 13 561 1 2586 SPRING
I have a complicated set of procedures to apply to these data (details are not important)
I wrote the following function that accomplishes what I want when split by the three grouping variables. It returns 3 variables that I am interested in (indiff, p_intercept, & p_lv)
getIndiffs <- function(currdelay){
if (mean(currdelay$later_choice) == 1) {
indiff = 10.5
p_intercept = "laters"
p_lv = "laters"
}
else if (mean(currdelay$later_choice) == 0) {
indiff = 30.5
# no p-val here, code that this was not calculated
p_intercept = "nows"
p_lv = "nows"
}
else {
F <- factor(currdelay$later_choice)
fit <- glm(F~later_value,data=currdelay,family=binomial())
indiff <- -coef(fit)[1]/coef(fit)[2]
if (indiff < 10) indiff = 10.5
else if (indiff > 30) indiff = 30.5
p_intercept = round(summary(fit)$coef[, "Pr(>|z|)"][1],3)
p_lv = round(summary(fit)$coef[, "Pr(>|z|)"][2], 3)
c(indiff,p_intercept,p_lv)
}
I am trying to use ddply to apply it to each subset of the data per the 3 grouping variables:
ddply(data,.(sub,condition,delay),getIndiffs)
However, when I run this I get the error
Error in list_to_dataframe(res, attr(.data, "split_labels")) :
Results do not have equal lengths
Strangely, this works fine when I use only 1 grouping variable but throws the error with 2+
Also, when I "simulate" splitting the dataset myself into a data drame only containing a subset split by the 3 grouping variables, my function works just fine. (Note: I've tried different ways of returning 3 variables or even returning just 1 variable and it does not work, either)
Basically, what I want to know is how to use plyr to use a function to return multiple variables.
Any other solutions to my problem that are fundamentally different are also welcome.
That error usually happens to me when my function applied to one of my pieces returns an empty data frame. In any case, an easy way to debug the situation is use dlply instead of ddply, and examine the output; for instance
x <- dlply(data,.(sub,condition,delay),getIndiffs)
sapply(x,ncol)
to check that they all have the same number of columns. If not, standardize your function more.
It looks like your function getIndiffs is designed to run on a single row, not on a whole dataframe. d*ply(x,vars,fn) hands fn() an entire data frame consisting of the subset of observations matching that group. Hm, also, the function can return in three different places -- at the end of each conditional clause. I think you meant to put c(indiff,p_intercept,p_lv) after the last } (and end your function with another }).

aaply fails on a vector

I am trying to understand how to use the excellent plyr package's commands on a vector (in my case, of strings). I suppose I'd want to use aaply, but it fails, asking for a margin. But there aren't columns or rows in my vector!
To be a bit more concrete, the following command works, but returns results in a wierd list. states.df is a data frame, and region is the name of the state (returned using Hadley's map_data("state") command). Thus, states.df$region is a vector of strings (specifically, state names). opinion.new is a vector of numbers, named using state names.
states.df <- map_data("state")
ch = sapply(states.df$region, function (x) { opinion.new[names(opinion.new)==x] } )
What I'd like to do is:
ch = aaply(states.df$region, function (x) { opinion.new[names(opinion.new)==x] } )
Where ch is the vector of numbers looked up or pulled from opinion.new. But aaply requires an array, and fails on a vector.
Thanks!
If you want to use plyr on a vector, you have to use l*ply, as follows:
v <- 1:10
sapply(v, function(x)x^2)
[1] 1 4 9 16 25 36 49 64 81 100
laply(v, function(x)x^2)
[1] 1 4 9 16 25 36 49 64 81 100
In other words, sapply and laply are equivalent

Resources