Extracting anomalous intervals from a data frame

Extracting anomalous intervals from a data frame - r

I have a large dataset where I am trying to extract intervals (from the column Zone) where the Anom value is >1 for 5+ consecutive cells, and calculate the means of each interval. In the example below I would like to extract the information that Anom intervals include Zones = 5 to 11 and 17 to 26, but ignoring 28 to 29 (as the number of consecutive cells is <5). Any help is much appreciated.
df <- data.frame("Zone" = 1:30, "Anom" = 1:30)
df[,2] <- 0
df[5:11,2] <- 1
df[17:26,2] <- 1
df[28:29,2] <- 1
df
Zone Anom
1 1 0
2 2 0
3 3 0
4 4 0
5 5 1
6 6 1
7 7 1
8 8 1
9 9 1
10 10 1
11 11 1
12 12 0
13 13 0
14 14 0
15 15 0
16 16 0
17 17 1
18 18 1
19 19 1
20 20 1
21 21 1
22 22 1
23 23 1
24 24 1
25 25 1
26 26 1
27 27 0
28 28 1
29 29 1
30 30 0
The sort of output I would like to generate
1 Zone.From Zone.To Anom.Mean
2 5 11 1
3 17 26 1

One way using dplyr and data.table's rleid is to create a new group for each change in Anom. For each group get first and last value of Zone, mean of Anom, number of rows in it and first value of Anom. We can then filter and keep only those groups where we have greater than equal to 5 rows and Anom is greater than 0.
library(dplyr)
df %>%
group_by(grp = data.table::rleid(Anom)) %>%
summarise(Zone.From = first(Zone),
Zone.To = last(Zone),
mean_anom = mean(Anom),
N = n(),
Anom = first(Anom)) %>%
filter(Anom > 0 & N >= 5) %>%
select(-c(grp, N, Anom))
# Zone.From Zone.To mean_anom
# <int> <int> <dbl>
#1 5 11 1
#2 17 26 1

Related

Drop rows in a data frame that are in-between two integer values in R

I have this data frame coming out of certain participant's behaviour in an episodic task, and let's say the episode starts at 90 and finishes when we have a certain trigger that can be in the range of 40s. I am doing a sample dataframe with a column with the number of the rows and the other with the actual triggers.
ex1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
ex2 <- c(41,1,1,90,1,1,1,44,1,90,1,2,42,1,1,1,1,90,1,41)
df <- data.frame(ex1,ex2)
> df
ex1 ex2
1 1 41
2 2 1
3 3 1
4 4 90
5 5 1
6 6 1
7 7 1
8 8 44
9 9 1
10 10 90
11 11 1
12 12 2
13 13 42
14 14 1
15 15 1
16 16 1
17 17 1
18 18 90
19 19 1
20 20 41
Now, what I am trying to do is remove all the rows that are outside the beginning and the end of the episode, as they are recordings of typed behaviour that is not interesting as it falls outside of the episode. Therefore, I want to end up with a dataframe like this:
ex1 <- c(1,4,5,6,7,8,10,11,12,13,18,19,20)
ex2 <- c(41,90,1,1,1,44,90,1,2,42,90,1,41)
df <- data.frame(ex1,ex2)
> df
ex1 ex2
1 1 41
2 4 90
3 5 1
4 6 1
5 7 1
6 8 44
7 10 90
8 11 1
9 12 2
10 13 42
11 18 90
12 19 1
13 20 41
I have been trying to use subset but I cannot make it work between a range and a number.
Thanks in advance!

Setting the values:
ex1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
ex2 <- c(41,1,1,90,1,1,1,44,1,90,1,2,42,1,1,1,1,90,1,41)
before <- data.frame(ex1,ex2)
before
ex1 ex2
1 1 41
2 2 1
3 3 1
4 4 90
5 5 1
6 6 1
7 7 1
8 8 44
9 9 1
10 10 90
11 11 1
12 12 2
13 13 42
14 14 1
15 15 1
16 16 1
17 17 1
18 18 90
19 19 1
20 20 41
I have built a function that should do the work.
The function is constructed based on my understanding of your problem so there is a chance that my function would not work perfectly to your setting.
However I believe you can do your task by adjusting the function a little bit to satisfy your needs.
library(dplyr)
episode <- function(start = 90, end = 40, data){#the default value of start is 90 and the default value of end is 40
#retrieving all the row indices that correspond to values that indicates an end
end_idx <- which(data$ex2>=end & data$ex2<=end+10)
#retrieving all the row indices that correspond to values that indicates a start
start_idx <- which(data$ex2==start)
#declaring a list that would contain the extracted sub samples in your liking
sub_sample_list <- vector("list", length(start_idx))
#looping through the start indices
for(i in 1:length(start_idx)){
#extracting the minimum among those have values larger than the i-th start_idx value
temp_end <- min(end_idx[end_idx>start_idx[i]])
#extracting the rows between the i-th start index and the minimum end index that is larger than the i-th start index
temp_sub_sample <- data[start_idx[i]:temp_end,]
#saving the sub-sample in the list
sub_sample_list[[i]] <- temp_sub_sample
}
#now row binding all the extracted sub samples
clean.df <- do.call(rbind.data.frame, sub_sample_list)
#if there is an end index that is smaller than the minimum start index
if(min(end_idx)< min(start_idx)){
#only retrieve those corresponding rows and add to the clean.df
clean.df <- rbind(data[end_idx[end_idx<min(start_idx)],], clean.df)
}
#cleaning up the row numbers a bit
rownames(clean.df) <- 1:nrow(clean.df)
#sort the clean.df by ex1
clean.df <- clean.df %>% arrange(ex1)
#returning the clean.df
return(clean.df)
}
Generating the after data set by using the episode function.
after <- episode(start = 90, end = 40, before)
after
ex1 ex2
1 1 41
2 4 90
3 5 1
4 6 1
5 7 1
6 8 44
7 10 90
8 11 1
9 12 2
10 13 42
11 18 90
12 19 1
13 20 41

And base:
ex1 <- c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
ex2 <- c(41,1,1,90,1,1,1,44,1,90,1,2,42,1,1,1,1,90,1,41)
df <- data.frame(ex1,ex2)
index start of series [90] and if not row 1 and subset out rows prior to start as incomplete:
start_idx <- which(df$ex2 == 90)
df <- df[start_idx[1]:nrow(df), ]
re-index start and index end >= 40 & < 90
start_idx <- which(df$ex2 == 90)
end_idx <- which(df$ex2 >= 40 & df$ex2 < 90)
make an empty list and for loop through, subsetting out start:end sections
df_lst <- list()
for (k in 1:length(start_idx)) {
df_lst[[k]] <- df[start_idx[k]:end_idx[k], ]
}
bring them all together
df2 <- do.call('rbind' df_lst)
df2
ex1 ex2
4 4 90
5 5 1
6 6 1
7 7 1
8 8 44
10 10 90
11 11 1
12 12 2
13 13 42
18 18 90
19 19 1
20 20 41
Fairly compact.

using intervals in a column to populate values for another column

I have a dataframe:
dataframe <- data.frame(Condition = rep(c(1,2,3), each = 5, times = 2),
Time = sort(sample(1:60, 30)))
Condition Time
1 1 1
2 1 3
3 1 4
4 1 7
5 1 9
6 2 11
7 2 12
8 2 14
9 2 16
10 2 18
11 3 19
12 3 24
13 3 25
14 3 28
15 3 30
16 1 31
17 1 34
18 1 35
19 1 38
20 1 39
21 2 40
22 2 42
23 2 44
24 2 47
25 2 48
26 3 49
27 3 54
28 3 55
29 3 57
30 3 59
I want to divide the total length of Time (i.e., max(Time) - min(Time)) per Condition by a constant 'x' (e.g., 3). Then I want to use that quotient to add a new variable Trial such that my dataframe looks like this:
Condition Time Trial
1 1 1 A
2 1 3 A
3 1 4 B
4 1 7 C
5 1 9 C
6 2 11 A
7 2 12 A
8 2 14 B
9 2 16 C
10 2 18 C
... and so on
As you can see, for Condition 1, Trial is populated with unique identifying values (e.g., A, B, C) every 2.67 seconds = 8 (total time) / 3. For Condition 2, Trial is populated every 2.33 seconds = 7 (total time) /3.
I am not getting what I want with my current code:
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, 3, labels = F)])
# Groups: Condition [3]
Condition Time Trial
<dbl> <int> <chr>
1 1 1 A
2 1 3 A
3 1 4 A
4 1 7 A
5 1 9 A
6 2 11 A
7 2 12 A
8 2 14 A
9 2 16 A
10 2 18 A
# ... with 20 more rows
Thanks!

We can get the diffrence of range (returns min/max as a vector) and divide by the constant passed into i.e. 3 as the breaks in cut). Then, use integer index (labels = FALSE) to get the corresponding LETTER from the LETTERS builtin R constant
library(dplyr)
dataframe %>%
group_by(Condition) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])
If the grouping should be based on adjacent values in 'Condition', use rleid from data.table on the 'Condition' column to create the grouping, and apply the same code as above
library(data.table)
dataframe %>%
group_by(grp = rleid(Condition)) %>%
mutate(Trial = LETTERS[cut(Time, diff(range(Time))/3,
labels = FALSE)])

Here's a one-liner using my santoku package. The rleid line is the same as mentioned in #akrun's solution.
dataframe %<>%
group_by(grp = data.table::rleid(Condition)) %>%
mutate(
Trial = chop_evenly(Time, intervals = 3, labels = lbl_seq("A"))
)

How can I create a new column with the same id every n rows in R?

I have a data frame where I want to create a new column in which to assign the same ID every 30 rows.
My data frame is from an experiment and I wish to create a new "bloc" column, so that every 30 rows it increments by 1
example:
col1 : response latency = 1,0002, 1.2566, ...30times, 1.5422, ...
col2 : difficulty = easy, hard, intermediate, ...
col3 : ID = 1, 2, 3, ...30times, 31, 32, ...
And I want a new column
new col : bloc = 1, 1, ...30times, 2, 2, ...30times, 3, 3, ...

Using 5 as an example, but this of course works the same for 30
df <- data.frame(rownum = 1:23)
bloc_len <- 5
df$bloc <-
rep(seq(1, 1 + nrow(df) %/% bloc_len), each = bloc_len, length.out = nrow(df))
df
# rownum bloc
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# 6 6 2
# 7 7 2
# 8 8 2
# 9 9 2
# 10 10 2
# 11 11 3
# 12 12 3
# 13 13 3
# 14 14 3
# 15 15 3
# 16 16 4
# 17 17 4
# 18 18 4
# 19 19 4
# 20 20 4
# 21 21 5
# 22 22 5
# 23 23 5
You could also use %/% (same output)
df$bloc <-
1 + seq(0, nrow(df) - 1) %/% bloc_len

You can use rep(x, times) function to create the bloc you wished.
See the example above
set.seed(12345)
Create a random data set
data <- data.frame(
response_latency = abs(rnorm(90, 2, 1)),
difficulty = sample(c("easy", "hard", "intermediate"), 90, replace = TRUE),
ID = 1:90
)
head(data, n = 35)
response_latency difficulty ID bloc
1 1.8890497 intermediate 1 1
2 2.9996586 intermediate 2 1
3 3.0255886 hard 3 1
4 0.3949156 hard 4 1
5 2.0027199 easy 5 1
6 2.9580737 hard 6 1
7 1.3337903 intermediate 7 1
8 1.4844084 hard 8 1
9 1.3941750 hard 9 1
10 1.6923244 intermediate 10 1
11 1.8186642 easy 11 1
12 0.9167691 easy 12 1
13 2.5987185 easy 13 1
14 1.8345693 intermediate 14 1
15 0.9177725 hard 15 1
16 2.3445309 easy 16 1
17 2.5187724 hard 17 1
18 1.2220053 hard 18 1
19 2.1636086 hard 19 1
20 0.7847963 hard 20 1
21 1.3785363 hard 21 1
22 2.9451529 intermediate 22 1
23 2.3722482 intermediate 23 1
24 2.1812877 intermediate 24 1
25 0.1383615 easy 25 1
26 1.3996498 easy 26 1
27 3.7593749 hard 27 1
28 2.0056114 hard 28 1
29 3.2195714 hard 29 1
30 2.1481248 easy 30 1
31 3.2546741 intermediate 31 2
32 2.4221608 hard 32 2
33 2.0465687 intermediate 33 2
34 1.7649423 easy 34 2
35 1.7338255 hard 35 2
Here, to add the bloc column in your dataset, you can use the following code:
bloc <- c(rep(x = 1, times = 30), rep(x = 2, times = 30), rep(x = 3, times = 30))
data$bloc <- bloc
head(data,n=35)
The new dataset will be as follow.
response_latency difficulty ID bloc
1 1.8890497 intermediate 1 1
2 2.9996586 intermediate 2 1
3 3.0255886 hard 3 1
4 0.3949156 hard 4 1
5 2.0027199 easy 5 1
6 2.9580737 hard 6 1
7 1.3337903 intermediate 7 1
8 1.4844084 hard 8 1
9 1.3941750 hard 9 1
10 1.6923244 intermediate 10 1
11 1.8186642 easy 11 1
12 0.9167691 easy 12 1
13 2.5987185 easy 13 1
14 1.8345693 intermediate 14 1
15 0.9177725 hard 15 1
16 2.3445309 easy 16 1
17 2.5187724 hard 17 1
18 1.2220053 hard 18 1
19 2.1636086 hard 19 1
20 0.7847963 hard 20 1
21 1.3785363 hard 21 1
22 2.9451529 intermediate 22 1
23 2.3722482 intermediate 23 1
24 2.1812877 intermediate 24 1
25 0.1383615 easy 25 1
26 1.3996498 easy 26 1
27 3.7593749 hard 27 1
28 2.0056114 hard 28 1
29 3.2195714 hard 29 1
30 2.1481248 easy 30 1
31 3.2546741 intermediate 31 2
32 2.4221608 hard 32 2
33 2.0465687 intermediate 33 2
34 1.7649423 easy 34 2
35 1.7338255 hard 35 2

Subsetting data, finding MAX, MIN, Mean and plotting it

So I have data in as follows:
id expressions mode
1 22 0
2 24 0
3 23 0
4 5 1
5 56 1
6 42 1
7 32 0
8 21 0
9 11 1
10 72 1
So I will get A new table according to the previous question asked:
id max mean min mode
1 24 23 22 0
2 56 51 5 1
3 32 26 21 0
4 72 41 11 1
So basically roll apply function with variable window which considers one window when toggle happens , which is in the output , I have shown.

We can use data.table to do a group by operation based on the run-length-id of 'mode' and get the max/mean/min/mode of 'expressions'
library(data.table)
setDT(df1)[, .(Max = max(expressions), Mean = round(mean(expressions)),
Min = min(expressions)), .(id = rleid(mode), mode)]
# id mode Max Mean Min
#1: 1 0 24 23 22
#2: 2 1 56 34 5
#3: 3 0 32 26 21
#4: 4 1 72 42 11
Or with tidyverse
library(dplyr)
df1 %>%
group_by(id = cumsum(c(TRUE, diff(mode) != 0)), mode) %>%
summarise_at(vars(expressions), funs(max, mean= round(mean(.)), min))
# id mode max mean min
# <int> <int> <dbl> <dbl> <dbl>
#1 1 0 24 23 22
#2 2 1 56 34 5
#3 3 0 32 26 21
#4 4 1 72 42 11

Combining 2 columns into 1 column many times in a very large dataset in R

Combining 2 columns into 1 column many times in a very large dataset in R
The clumsy solutions I am working on are not going to be very fast if I can get them to work and the true dataset is ~1500 X 45000 so they need to be fast. I definitely at a loss for 1) at this point although have some code for 2) and 3).
Here is a toy example of the data structure:
pop = data.frame(status = rbinom(n, 1, .42), sex = rbinom(n, 1, .5),
age = round(rnorm(n, mean=40, 10)), disType = rbinom(n, 1, .2),
rs123=c(1,3,1,3,3,1,1,1,3,1), rs123.1=rep(1, n), rs157=c(2,4,2,2,2,4,4,4,2,2),
rs157.1=c(4,4,4,2,4,4,4,4,2,2), rs132=c(4,4,4,4,4,4,4,4,2,2),
rs132.1=c(4,4,4,4,4,4,4,4,4,4))
Thus, there are a few columns of basic demographic info and then the rest of the columns are biallelic SNP info. Ex: rs123 is allele 1 of rs123 and rs123.1 is the second allele of rs123.
1) I need to merge all the biallelic SNP data that is currently in 2 columns into 1 column, so, for example: rs123 and rs123.1 into one column (but within the dataset):
11
31
11
31
31
11
11
11
31
11
2) I need to identify the least frequent SNP value (in the above example it is 31).
3) I need to replace the least frequent SNP value with 1 and the other(s) with 0.

Do you mean 'merge' or 'rearrange' or simply concatenate? If it is the latter then
R> pop2 <- data.frame(pop[,1:4], rs123=paste(pop[,5],pop[,6],sep=""),
+ rs157=paste(pop[,7],pop[,8],sep=""),
+ rs132=paste(pop[,9],pop[,10], sep=""))
R> pop2
status sex age disType rs123 rs157 rs132
1 0 0 42 0 11 24 44
2 1 1 37 0 31 44 44
3 1 0 38 0 11 24 44
4 0 1 45 0 31 22 44
5 1 1 25 0 31 24 44
6 0 1 31 0 11 44 44
7 1 0 43 0 11 44 44
8 0 0 41 0 11 44 44
9 1 1 57 0 31 22 24
10 1 1 40 0 11 22 24
and now you can do counts and whatnot on pop2:
R> sapply(pop2[,5:7], table)
$rs123
11 31
6 4
$rs157
22 24 44
3 3 4
$rs132
24 44
2 8
R>

Develop Reference

r css asp.net wordpress firebase qt symfony nginx http apache-flex

Extracting anomalous intervals from a data frame - r

Related

Drop rows in a data frame that are in-between two integer values in R

using intervals in a column to populate values for another column

How can I create a new column with the same id every n rows in R?

Subsetting data, finding MAX, MIN, Mean and plotting it

Combining 2 columns into 1 column many times in a very large dataset in R

Categories

Resources