Analysis of several behaviours of 2 individuals - r
I am analysing several animal behaviours during a defined time period.
I watch videos of the animals and their behaviours. I record when each behaviour is displayed. They will display each behaviour several times during the recording (which correspond to the different events). Sometimes 2 or 3 behaviours can be displayed at the same time during the recording, but they don't usually start/finish exactly at the same time (so they overlap partly).
I end up with a series of events for each behaviour, and for each event I have their onset, duration and end point (see example hereafter).
I need to extract from this data the total amount during which behaviour 1 overlaps with behaviour 2 / behaviour 1 overlaps with behaviour 3 / behaviour 2 overlaps with behaviour 3. This is so that I can find correlations between behaviours, which ones tend to be displayed at the same time, which ones do not, ...
I am only a beginner with programming (mostly R) and I find it hard to get started. Can you please advise me how to proceed? Many thanks!
Example with a series of events for 3 behaviours:
Event tracked Onset Duration End
Behaviour 1 _event 1 7.40 548.88 556.28
Behaviour 1 _event 2 36.20 0.47 36.67
Behaviour 1 _event 3 48.45 0.25 48.70
Behaviour 1 _event 4 68.92 1.53 70.45
Behaviour 1 _event 5 75.48 0.22 75.70
Behaviour 1 _event 6 89.75 0.66 90.41
Behaviour 1 _event 7 94.62 0.16 94.78
Behaviour 1 _event 8 101.78 0.22 102.00
Behaviour 1 _event 9 108.86 0.59 109.45
Behaviour 1 _event 10 146.35 0.66 147.00
Behaviour 1 _event 11 150.20 0.75 150.95
Behaviour 1 _event 12 152.98 0.66 153.64
Behaviour 1 _event 13 157.84 0.56 158.41
Behaviour 2_event 1 7.52 0.38 7.90
Behaviour 2_event 2 18.73 0.16 18.88
Behaviour 2_event 3 19.95 2.25 22.20
Behaviour 2_event 4 26.41 0.25 26.66
Behaviour 2_event 5 35.91 0.16 36.07
Behaviour 2_event 6 37.29 0.34 37.63
Behaviour 2_event 7 38.13 0.72 38.85
Behaviour 2_event 8 40.19 0.31 40.51
Behaviour 2_event 9 44.26 0.16 44.41
Behaviour 2_event 10 45.32 0.16 45.48
Behaviour 2_event 11 54.84 1.44 56.27
Behaviour 2_event 12 56.65 1.19 57.84
Behaviour 2_event 13 61.59 1.03 62.62
Behaviour 2_event 14 81.13 3.83 84.96
Behaviour 2_event 15 86.65 0.31 86.96
Behaviour 2_event 16 90.15 0.19 90.34
Behaviour 2_event 17 96.97 0.53 97.50
Behaviour 2_event 18 107.12 0.22 107.34
Behaviour 2_event 19 118.53 0.41 118.94
Behaviour 2_event 20 127.76 0.25 128.01
Behaviour 2_event 21 129.45 0.69 130.13
Behaviour 2_event 22 130.60 2.31 132.91
Behaviour 2_event 23 141.01 0.41 141.41
Behaviour 2_event 24 152.85 0.37 153.23
Behaviour 2_event 25 156.54 0.13 156.66
Behaviour 3_event 1 7.71 1.94 9.65
Behaviour 3_event 2 11.12 1.53 12.65
Behaviour 3_event 3 19.01 0.19 19.20
Behaviour 3_event 4 20.01 3.97 23.98
Behaviour 3_event 5 24.95 4.22 29.16
Behaviour 3_event 6 29.70 2.19 31.88
Behaviour 3_event 7 33.23 2.50 35.73
Behaviour 3_event 8 36.82 0.44 37.26
Behaviour 3_event 9 38.20 1.16 39.35
Behaviour 3_event 10 39.91 2.13 42.04
Behaviour 3_event 11 42.49 3.62 46.11
Behaviour 3_event 12 47.09 0.53 47.62
Behaviour 3_event 13 48.15 0.34 48.49
Behaviour 3_event 14 49.40 2.13 51.52
Behaviour 3_event 15 57.57 2.25 59.82
Behaviour 3_event 16 60.89 0.88 61.76
Behaviour 3_event 17 66.85 6.78 73.63
Behaviour 3_event 18 75.65 3.03 78.68
In order to do the kind of study you want to do, it might be easiest to convert the data to a time series with variables on states (i.e. whether behavior 1, 2, 3, etc. is being displayed.) So you want to transform the dataset you have to one that looks like
time animal behav_1 behav_2 behav_3
0 1 FALSE TRUE FALSE
0 2 TRUE FALSE FALSE
1 1 FALSE TRUE FALSE
1 2 TRUE FALSE TRUE
... ... ... ... ...
Each row tells whether a particular animal is displaying each of the three behaviors at the given time. (I am assuming here that you have multiple animals and you want to keep their behavior data separate.)
Then you could easily approximate many of the quantities you are interested in. For example, you could compute the probability that an animal is doing behavior 1 given it is doing behavior 2 by
Computing a column data$behav_1_and_2 <- data$behav_1 & data$behav_2
Dividing the sum of the col behav_1_and_2 by the sum of behav_2: sum(data$behav_1_and_2) / sum(data$behav_2)
Okay, but how do you transform the data? First, decide how many time points you want to check. Maybe you should increment by about 0.1.
num_animals <- 10 // How many animals you have
time_seq <- seq(from = 0, to = 600, by = 0.1) // to should be end of video
data <- expand.grid(time = time_seq, animal = num_animals)
That gets you the first two columns of the data frame you want. Then you need to compute the three behavior columns. Define a function that takes the time, animal, and name of the behavior column, and returns TRUE if the animal is doing that behavior at the time, or FALSE if not.
has_behavior <- function(time, animal, behavior) {
...
}
(I'm going to let you figure out how to make that function.) With that function in hand, you can then create the last three columns with a loop:
// First create empty columns
data$behav_1 <- logical(nrow(data))
data$behav_2 <- logical(nrow(data))
data$behav_3 <- logical(nrow(data))
// Now loop through rows
for (i in 1:nrow(data)) {
data$behav_1[i] <- has_behavior(data$time[i], data$animal[i], 1)
data$behav_2[i] <- has_behavior(data$time[i], data$animal[i], 2)
data$behav_3[i] <- has_behavior(data$time[i], data$animal[i], 3)
}
With data in this format, you should be able to study the problem much more easily. You can compute those summary quantities easily, as I outlined earlier. This data frame is also set up to be useful for doing time series modeling. And it's also tidy, making it easy to use with packages like dplyr for data summarising and ggplot2 for visualization. (You can learn more about those last two tools in the free online book R for Data Science by Hadley Wickham.)
Related
Is there a way to resolve this error in cardinality_threshold problem?
I tried to use ggpairs to visualise my dataset but the error message that I am getting is what I don't understand. Can someone please help me? > describe(Mydata) vars n mean sd median trimmed mad min max range skew Time 1 192008 4257.07 2589.28 4156.44 4210.33 3507.03 0 8869.91 8869.91 0.09 Source* 2 192008 9.32 5.95 8.00 8.53 2.97 1 51.00 50.00 3.39 Destination* 3 192008 8.22 6.49 7.00 7.31 2.97 1 51.00 50.00 3.07 Protocol* 4 192008 16.14 4.29 19.00 16.77 0.00 1 20.00 19.00 -1.26 Length 5 192008 166.12 464.07 74.00 96.25 11.86 60 21786.00 21726.00 14.40 Info* 6 192008 63731.70 46463.90 60732.50 62899.62 69904.59 1 131625.00 131624.00 0.14 kurtosis se Time -1.28 5.91 Source* 15.94 0.01 Destination* 13.21 0.01 Protocol* 0.66 0.01 Length 349.17 1.06 Info* -1.47 106.04 > Mydata[,1][Mydata[,1] ==0]<-NA > ggpairs(Mydata) Error in stop_if_high_cardinality(data, columns, cardinality_threshold) : Column 'Source' has more levels (51) than the threshold (15) allowed. Please remove the column or increase the 'cardinality_threshold' parameter. Increasing the cardinality_threshold may produce long processing times
As the error suggests, the way to get rid of the error is to set cardinality_threshold=NULL or cardinality_threshold=51 as Source and Destination are both factor variables with 51 levels. However, they're likely to be hard to see any detail in the plots, if it plots at all because one of the panels of the plot would be attempting to fit 51 barplots with 51 columns into it. You may want to think if grouping your factor levels makes sense for the analysis you're interested in, or exclude the factors (although that only leaves two continuous variables).
Ramp up/down missing time-series data in R
I have a set of time-series data (GPS speed data, specifically), which includes gaps of missing values where the signal was lost. For missing periods of short durations I am about to fill simply using a na.spline, however this is inappropriate with longer time periods. I would like to ramp the values from the last true value down to zero, based on predefined acceleration limits. #create sample data frame test <- as.data.frame(c(6,5.7,5.4,5.14,4.89,4.64,4.41,4.19,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,NA,5,5.1,5.3,5.4,5.5)) names(test)[1] <- "speed" #set rate of acceleration for ramp ramp <- 6 #set sampling rate of receiver Hz <- 1/10 So for missing data the ramp would use the previous value and the rate of acceleration to get the next data point, until speed reached zero (i.e. last speed [4.19] + (Hz * ramp)), yielding the following values: 3.59 2.99 2.39 1.79 1.19 0.59 0 Lastly, I need to do this in the reverse fashion, to ramp up from zero when the signal picks back up again. Hope this is clear. Cheers
It's not really elegant, but you can do it in a loop. na.pos <- which(is.na(test$speed)) acc = FALSE for (i in na.pos) { if (acc) { speed <- test$speed[i-1]+(Hz*ramp) } else { speed <- test$speed[i-1]-(Hz*ramp) if (round(speed,1) < 0) { acc <- TRUE speed <- test$speed[i-1]+(Hz*ramp) } } test[i,] <- speed } The result is: speed 1 6.00 2 5.70 3 5.40 4 5.14 5 4.89 6 4.64 7 4.41 8 4.19 9 3.59 10 2.99 11 2.39 12 1.79 13 1.19 14 0.59 15 -0.01 16 0.59 17 1.19 18 1.79 19 2.39 20 2.99 21 3.59 22 4.19 23 4.79 24 5.00 25 5.10 26 5.30 27 5.40 28 5.50 Note that '-0.01', because 0.59-(6*10) is -0.01, not 0. You can round it later, I decided not to.
When the question says "ramp the values from the last true value down to zero" in each run of NAs I assume that that means that any remaining NAs in the run after reaching zero are also to be replaced by zero. Now, use rleid from data.table to create a grouping vector the same length as test$speed identifying each run in is.na(test$speed) and use ave to create sequence numbers within such groups, seqno. Then calculate the declining sequences, ramp_down by combining na.locf(test$speed) and seqno. Finally replace the NAs. library(data.table) library(zoo) test_speed <- test$speed seqno <- ave(test_speed, rleid(is.na(test_speed)), FUN = seq_along) ramp_down <- pmax(na.locf(test_speed) - seqno * ramp * Hz, 0) result <- ifelse(is.na(test_speed), ramp_down, test_speed) giving: > result [1] 6.00 5.70 5.40 5.14 4.89 4.64 4.41 4.19 3.59 2.99 2.39 1.79 1.19 0.59 0.00 [16] 0.00 0.00 0.00 0.00 0.00 0.00 0.00 0.00 5.00 5.10 5.30 5.40 5.50
Find where species accumulation curve reaches asymptote
I have used the specaccum() command to develop species accumulation curves for my samples. Here is some example data: site1<-c(0,8,9,7,0,0,0,8,0,7,8,0) site2<-c(5,0,9,0,5,0,0,0,0,0,0,0) site3<-c(5,0,9,0,0,0,0,0,0,6,0,0) site4<-c(5,0,9,0,0,0,0,0,0,0,0,0) site5<-c(5,0,9,0,0,6,6,0,0,0,0,0) site6<-c(5,0,9,0,0,0,6,6,0,0,0,0) site7<-c(5,0,9,0,0,0,0,0,7,0,0,3) site8<-c(5,0,9,0,0,0,0,0,0,0,1,0) site9<-c(5,0,9,0,0,0,0,0,0,0,1,0) site10<-c(5,0,9,0,0,0,0,0,0,0,1,6) site11<-c(5,0,9,0,0,0,5,0,0,0,0,0) site12<-c(5,0,9,0,0,0,0,0,0,0,0,0) site13<-c(5,1,9,0,0,0,0,0,0,0,0,0) species_counts<-rbind(site1,site2,site3,site4,site5,site6,site7,site8,site9,site10,site11,site12,site13) accum <- specaccum(species_counts, method="random", permutations=100) plot(accum) In order to ensure I have sampled sufficiently, I need to make sure the curve of the species accumulation plot reaches an asymptote, defined as a slope of <0.3 between the last two points (ei between sites 12 and 13). results <- with(accum, data.frame(sites, richness, sd)) Produces this: sites richness sd 1 1 3.46 0.9991916 2 2 4.94 1.6625403 3 3 5.94 1.7513054 4 4 7.05 1.6779918 5 5 8.03 1.6542263 6 6 8.74 1.6794660 7 7 9.32 1.5497149 8 8 9.92 1.3534841 9 9 10.51 1.0492422 10 10 11.00 0.8408750 11 11 11.35 0.7017295 12 12 11.67 0.4725816 13 13 12.00 0.0000000 I feel like I'm getting there. I could generate an lm with site vs richness and extract the exact slope (tangent?) between sites 12 and 13. Going to search a bit longer here.
Streamlining your data generation process a little bit: species_counts <- matrix(c(0,8,9,7,0,0,0,8,0,7,8,0, 5,0,9,0,5,0,0,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,6,0,0, 5,0,9,0,0,0,0,0,0,0,0,0, 5,0,9,0,0,6,6,0,0,0,0,0, 5,0,9,0,0,0,6,6,0,0,0,0, 5,0,9,0,0,0,0,0,7,0,0,3, 5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,0, 5,0,9,0,0,0,0,0,0,0,1,6, 5,0,9,0,0,0,5,0,0,0,0,0, 5,0,9,0,0,0,0,0,0,0,0,0, 5,1,9,0,0,0,0,0,0,0,0,0), byrow=TRUE,nrow=13) Always a good idea to set.seed() before running randomization tests (and let us know that specaccum is in the vegan package): set.seed(101) library(vegan) accum <- specaccum(species_counts, method="random", permutations=100) Extract the richness and sites components from within the returned object and compute d(richness)/d(sites) (note that the slope vector is one element shorter than the origin site/richness vectors: be careful if you're trying to match up slopes with particular numbers of sites) (slopes <- with(accum,diff(richness)/diff(sites))) ## [1] 1.45 1.07 0.93 0.91 0.86 0.66 0.65 0.45 0.54 0.39 0.32 0.31 In this case, the slope never actually goes below 0.3, so this code for finding the first time that the slope falls below 0.3: which(slopes<0.3)[1] returns NA.
Adding error bars to an existing plot using values from a column
I am having problems trying to add the 95% CI (lower/upper) to an existing plot using the values from two columns in a matrix. What would be the best way to use this information to add the error bars? Here is a sample of my data: option<-read.table(text=" distance p.move id option mean lower95%CI upper95%CI 1 close 0.05 1 10% 13.682 11.306 15.768 2 close 0.10 2 10% 10.886 9.336 12.270 3 close 0.15 3 10% 8.402 7.262 9.580 4 close 0.20 4 10% 7.240 6.132 8.350 5 close 0.25 5 10% 6.322 5.288 7.370 6 close 0.30 6 10% 5.850 4.920 6.714 7 close 0.35 7 10% 3.838 3.084 4.648 8 close 0.40 8 10% 3.600 2.936 4.200 9 close 0.45 9 10% 3.380 2.702 4.016 10 close 0.50 10 10% 3.152 2.462 3.720 11 close 0.55 11 10% 2.772 2.214 3.286 12 close 0.60 12 10% 3.072 2.458 3.596 13 close 0.65 13 10% 2.670 2.134 3.212 14 close 0.70 14 10% 2.194 1.724 2.634 15 close 0.75 15 10% 1.980 1.612 2.336 16 close 0.80 16 10% 2.028 1.594 2.466 17 close 0.85 17 10% 1.650 1.294 1.974 18 close 0.90 18 10% 1.916 1.564 2.254",header=T) option This is my plot: plot(option$mean~option$p.move,xlim=c(0,1),type="o",ylim=c(0,20), xlab="Probability",ylab="% time",col=1,lwd=1.85) Thanks a lot in advance,
You can just add the extra columns in with lines which is like plot but draws on the existing plot. (see ?lines, ?points). Also, when you plot with a data frame you can skip all the option$ by feeding option into the dat argument (see ?plot): # draw original plot plot(mean ~ p.move, dat=option, xlim=c(0,1), type="o", ylim=c(0,20), xlab="Probability",ylab="% time",col=1,lwd=1.85) # draw extra lines (the '%' in the column names gets converted to '.' by R) # note you can put your usual `plot` arguments into `lines` like lwd, type etc # if you want lines(upper95.CI ~ p.move, option) lines(lower95.CI ~ p.move, option)
sapply? tapply? ddply? dataframe variable based on rolling index of previous values of another variable
I haven't found something which precisely matches what I need, so I thought I'd post this. I have a number of functions which basically rely on a rolling index of a variable, with a function, and should naturally flow back into the dataframe they came from. For example, data<-as.data.frame(as.matrix(seq(1:30))) data$V1<-data$V1/100 str(data) data$V1<-NA # rolling 5 day product for (i in 5:nrow(data)){ start<-i-5 end<-i data$V1_MA5d[i]<- (prod(((data$V1[start:end]/100)+1))-1)*100 } data > head(data,15) V1 V1_MA5d 1 0.01 NA 2 0.02 NA 3 0.03 NA 4 0.04 NA 5 0.05 0.1500850 6 0.06 0.2101751 7 0.07 0.2702952 8 0.08 0.3304453 9 0.09 0.3906255 10 0.10 0.4508358 11 0.11 0.5110762 12 0.12 0.5713467 13 0.13 0.6316473 14 0.14 0.6919780 15 0.15 0.7523389 But really, I should be able to do something like: data$V1_MA5d<-sapply(data$V1, function(x) prod(((data$V1[i-5:i]/100)+1))-1)*100 But I'm not sure what that would look like. Likewise, the count of a variable by another variable: data$V1_MA5_cat<-NA data$V1_MA5_cat[data$V1_MA5d<.5]<-0 data$V1_MA5_cat[data$V1_MA5d>.5]<-1 data$V1_MA5_cat[data$V1_MA5d>1.5]<-2 table(data$V1_MA5_cat) data$V1_MA5_cat_n<-NA data$V1_MA5_cat_n[data$V1_MA5_cat==0]<-nrow(subset(data,V1_MA5_cat==0)) data$V1_MA5_cat_n[data$V1_MA5_cat==1]<-nrow(subset(data,V1_MA5_cat==1)) data$V1_MA5_cat_n[data$V1_MA5_cat==2]<-nrow(subset(data,V1_MA5_cat==2)) > head(data,15) V1 V1_MA5d V1_MA5_cat V1_MA5_cat_n 1 0.01 NA NA NA 2 0.02 NA NA NA 3 0.03 NA NA NA 4 0.04 NA NA NA 5 0.05 0.1500850 0 6 6 0.06 0.2101751 0 6 7 0.07 0.2702952 0 6 8 0.08 0.3304453 0 6 9 0.09 0.3906255 0 6 10 0.10 0.4508358 0 6 11 0.11 0.5110762 1 17 12 0.12 0.5713467 1 17 13 0.13 0.6316473 1 17 14 0.14 0.6919780 1 17 15 0.15 0.7523389 1 17 I know there is a better way - help!
You can do this one of a few ways. Its worth mentioning here that you did write a "correct" for loop in R. You preallocated the vector by assigning data$V1_MA5d <- NA. This way you are filling rather than growing and its actually fairly efficient. However, if you want to use the apply family: sapply(5:nrow(data), function(i) (prod(data$V1[(i-5):i]/100 + 1)-1)*100) [1] 0.1500850 0.2101751 0.2702952 0.3304453 0.3906255 0.4508358 0.5110762 0.5713467 0.6316473 0.6919780 0.7523389 0.8127299 [13] 0.8731511 0.9336024 0.9940839 1.0545957 1.1151376 1.1757098 1.2363122 1.2969448 1.3576077 1.4183009 1.4790244 1.5397781 [25] 1.6005622 1.6613766 Notice my code inside the [] is different from yours. check out the difference: i <- 10 i - 5:i (i-5):i Or you can use rollapply from the zoo package: library(zoo) myfun <- function(x) (prod(x/100 + 1)-1)*100 rollapply(data$V1, 5, myfun) [1] 0.1500850 0.2001551 0.2502451 0.3003552 0.3504853 0.4006355 0.4508057 0.5009960 0.5512063 0.6014367 0.6516872 0.7019577 [13] 0.7522484 0.8025591 0.8528899 0.9032408 0.9536118 1.0040030 1.0544142 1.1048456 1.1552971 1.2057688 1.2562606 1.3067726 [25] 1.3573047 1.4078569 As per the comment, this will give you a vector of length 26... instead you can add a few arguments to rollapply to make it match with your initial data: rollapply(data$V1, 5, myfun, fill=NA, align='right') In regard to your second question, plyr is handy here. library(plyr) data$cuts <- cut(data$V1_MA5d, breaks=c(-Inf, 0.5, 1.5, Inf)) ddply(data, .(cuts), transform, V1_MA5_cat_n=length(cuts)) But there are many other choices too.