I work with data frames for flight movements (~ 1 million rows * 108 variables) and want to group phases during which a certain criterion is met (i.e. the value of a certain variable). In order to identify these groups, I want to number them.
Being a R newbie, I made it work for my case. Now I am looking for a more elegant way. In particular, I would like to overcome with the "useless" gaps in the numbering of the groups.
I provide a simplified example of my dplyr data frame with the value THR for the threshold criterion. The rows are sorted by the timestamp (and thus, i can truncate this here).
THR <- c(13,17,19,22,21,19,17,12,12,17,20,20,20,17,17,13, 20,20,17,13)
df <- as.data.frame(THR)
df <- tbl_df(df)
To flag all rows where the criterion is (not) met
df <- mutate(df, CRIT = THR < 19)
With the following, I managed to conditionally "cumsum" to get a unique group identification:
df <- mutate(df, GRP = ifelse(CRIT == 1, 0, cumsum(CRIT))
df
x CRIT GRP
1 13 TRUE 0
2 17 TRUE 0
3 19 FALSE 2
4 22 FALSE 2
5 21 FALSE 2
6 19 FALSE 2
7 17 TRUE 0
8 12 TRUE 0
9 12 TRUE 0
10 17 TRUE 0
11 20 FALSE 6
12 20 FALSE 6
While this does the trick and I can operate on the groups with group_by (e.g. summarise, filter), the numbering is not ideal as can be seen in the example output. In this example the 1st is numbered 2, and the 2nd group is numbered 6 which is in line with the cumsum() result.
I would appreciate, if anybody could shed some light on me. I was not able to find an appropriate solution in other posts.
I don't you can really avoid that preliminary step of creating CRIT, though I'd suggest to add cumsum when creating it and then just run a simple cumsum/diff wrap up on it. Also, If you don't need the groups that aren't meeting the criteria, it is better to assign NA instead of just some random number such as zero. Here's a possible data.table wrap up (also, you don't need the df <- tbl_df(df) step at all)
library(data.table)
setDT(df)[, CRIT := cumsum(THR < 19)]
df[THR >= 19, GRP := cumsum(c(0L, diff(CRIT)) != 0L) + 1L]
# THR CRIT GRP
# 1: 13 1 NA
# 2: 17 2 NA
# 3: 19 2 1
# 4: 22 2 1
# 5: 21 2 1
# 6: 19 2 1
# 7: 17 3 NA
# 8: 12 4 NA
# 9: 12 5 NA
# 10: 17 6 NA
# 11: 20 6 2
# 12: 20 6 2
# 13: 20 6 2
# 14: 17 7 NA
# 15: 17 8 NA
# 16: 13 9 NA
# 17: 20 9 3
# 18: 20 9 3
# 19: 17 10 NA
# 20: 13 11 NA
You can do:
x = rle(df$CRIT)
mask = x$values
x$values[mask] = 0
x$values[!mask] = cumsum(!x$values[!mask])
mutate(df, GRP=inverse.rle(x))
# THR CRIT GRP
#1 13 TRUE 0
#2 17 TRUE 0
#3 19 FALSE 1
#4 22 FALSE 1
#5 21 FALSE 1
#6 19 FALSE 1
#7 17 TRUE 0
#8 12 TRUE 0
#9 12 TRUE 0
#10 17 TRUE 0
#11 20 FALSE 2
#12 20 FALSE 2
#13 20 FALSE 2
#14 17 TRUE 0
#15 17 TRUE 0
#16 13 TRUE 0
#17 20 FALSE 3
#18 20 FALSE 3
#19 17 TRUE 0
#20 13 TRUE 0
Related
I'd like to assign a sequence to a column of TRUE or FALSE values, with the sequence holding the same value for the TRUE following one OR many consecutive FALSE values.
i.e what i'm trying to replicate is this;
df <- data.frame(
new_activity = as.character(c("TRUE","TRUE","TRUE","FALSE","TRUE","TRUE","TRUE","TRUE","TRUE","TRUE",
"TRUE","TRUE","TRUE","TRUE","FALSE","FALSE","TRUE","TRUE","FALSE","TRUE")),
sequence = as.numeric(c(1,2,3,4,4,5,6,7,8,9,10,11,12,13,14,14,14,15,16,16)))
Convert your column to type logical, use cumsum first and then add the negation of the input vector to get desired output
x <- as.logical(df$new_activity)
cumsum(x) + !x
# [1] 1 2 3 4 4 5 6 7 8 9 10 11 12 13 14 14 14 15 16 16
One option would be to convert the 'new_activity' to logical, place it on the i, create column 'sequence1' as the sequence of rows, then pad the NAs with na.locf
library(zoo)
library(data.table)
setDT(df)[as.logical(new_activity), sequence1 := seq_len(.N)
][, sequence1 := na.locf(sequence1, fromLast = TRUE)]
df
# new_activity sequence sequence1
# 1: TRUE 1 1
# 2: TRUE 2 2
# 3: TRUE 3 3
# 4: FALSE 4 4
# 5: TRUE 4 4
# 6: TRUE 5 5
# 7: TRUE 6 6
# 8: TRUE 7 7
# 9: TRUE 8 8
#10: TRUE 9 9
#11: TRUE 10 10
#12: TRUE 11 11
#13: TRUE 12 12
#14: TRUE 13 13
#15: FALSE 14 14
#16: FALSE 14 14
#17: TRUE 14 14
#18: TRUE 15 15
#19: FALSE 16 16
#20: TRUE 16 16
I would like to create groups from a base by matching values.
I have the following data table:
now<-c(1,2,3,4,24,25,26,5,6,21,22,23)
before<-c(0,1,2,3,23,24,25,4,5,0,21,22)
after<-c(2,3,4,5,25,26,0,6,0,22,23,24)
df<-as.data.frame(cbind(now,before,after))
which reproduces the following data:
now before after
1 1 0 2
2 2 1 3
3 3 2 4
4 4 3 5
5 24 23 25
6 25 24 26
7 26 25 0
8 5 4 6
9 6 5 0
10 21 0 22
11 22 21 23
12 23 22 24
I would like to get:
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
5 5 4 6 A
6 6 5 0 A
7 21 0 22 B
8 22 21 23 B
9 23 22 24 B
10 24 23 25 B
11 25 24 26 B
12 26 25 0 B
I would like to reach the answer to this without using a "for" loop becouse the real data is too large.
Any you could provide will be appreciated.
Here is one way. It is hard to avoid a for-loop as this is quite a tricky algorithm. The objection to them is often on the grounds of elegance rather than speed, but sometimes they are entirely appropriate.
df$group <- seq_len(nrow(df)) #assign each row to its own group
stop <- FALSE #indicates convergence
while(!stop){
pre <- df$group #group column at start of loop
for(i in seq_len(nrow(df))){
matched <- which(df$before==df$now[i] | df$after==df$now[i]) #check matches in before and after columns
group <- min(df$group[i], df$group[matched]) #identify smallest group no of matching rows
df$group[i] <- group #set to smallest group
df$group[matched] <- group #set to smallest group
}
if(identical(df$group, pre)) stop <- TRUE #stop if no change
}
df$group <- LETTERS[match(df$group, sort(unique(df$group)))] #convert groups to letters
#(just use match(...) to keep them as integers - e.g. if you have more than 26 groups)
df <- df[order(df$group, df$now),] #reorder as required
df
now before after group
1 1 0 2 A
2 2 1 3 A
3 3 2 4 A
4 4 3 5 A
8 5 4 6 A
9 6 5 0 A
10 21 0 22 B
11 22 21 23 B
12 23 22 24 B
5 24 23 25 B
6 25 24 26 B
7 26 25 0 B
I was looking to separate rows of data by Cue and adding a row which calculate averages per subject. Here is an example:
Before:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379
After:
Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 0.67978 0.51071 0.31723
4 4 22 0.26855 0.17487 0.22461
5 4 20 0.15106 0.48767 0.49072
6 0.209 0.331 0.357
7 7 18 0.11627 0.12604 0.2832
8 7 24 0.50201 0.14252 0.21454
9 0.309 0.134 0.248
10 12 16 0.27649 0.96008 0.42114
11 12 18 0.60852 0.21637 0.18799
12 0.442 0.588 0.304
13 22 20 0.32867 0.65308 0.29388
14 22 24 0.25726 0.37048 0.32379
15 0.292 0.511 0.308
So in the "after" example, line 3 is the average of lines 1 and 2 (line 6 is the average of lines 4 and 5, etc...).
Any help/information would be greatly appreciated!
Thank you!
You can use base r to do something like:
Reduce(rbind,by(data,data[1],function(x)rbind(x,c(NA,NA,colMeans(x[-(1:2)])))))
Cue ITI a b c
1 0 16 0.820620 0.521850 0.276790
2 0 24 0.538940 0.499570 0.357670
3 NA NA 0.679780 0.510710 0.317230
32 4 22 0.268550 0.174870 0.224610
4 4 20 0.151060 0.487670 0.490720
31 NA NA 0.209805 0.331270 0.357665
5 7 18 0.116270 0.126040 0.283200
6 7 24 0.502010 0.142520 0.214540
33 NA NA 0.309140 0.134280 0.248870
7 12 16 0.276490 0.960080 0.421140
8 12 18 0.608520 0.216370 0.187990
34 NA NA 0.442505 0.588225 0.304565
9 22 20 0.328670 0.653080 0.293880
10 22 24 0.257260 0.370480 0.323790
35 NA NA 0.292965 0.511780 0.308835
Here is one idea. Split the data frame, perform the analysis, and then combine them together.
DF_list <- split(DF, f = DF$Cue)
DF_list2 <- lapply(DF_list, function(x){
df_temp <- as.data.frame(t(colMeans(x[, -c(1, 2)])))
df_temp[, c("Cue", "ITI")] <- NA
df <- rbind(x, df_temp)
return(df)
})
DF2 <- do.call(rbind, DF_list2)
rownames(DF2) <- 1:nrow(DF2)
DF2
# Cue ITI a b c
# 1 0 16 0.820620 0.521850 0.276790
# 2 0 24 0.538940 0.499570 0.357670
# 3 NA NA 0.679780 0.510710 0.317230
# 4 4 22 0.268550 0.174870 0.224610
# 5 4 20 0.151060 0.487670 0.490720
# 6 NA NA 0.209805 0.331270 0.357665
# 7 7 18 0.116270 0.126040 0.283200
# 8 7 24 0.502010 0.142520 0.214540
# 9 NA NA 0.309140 0.134280 0.248870
# 10 12 16 0.276490 0.960080 0.421140
# 11 12 18 0.608520 0.216370 0.187990
# 12 NA NA 0.442505 0.588225 0.304565
# 13 22 20 0.328670 0.653080 0.293880
# 14 22 24 0.257260 0.370480 0.323790
# 15 NA NA 0.292965 0.511780 0.308835
DATA
DF <- read.table(text = " Cue ITI a b c
1 0 16 0.82062 0.52185 0.27679
2 0 24 0.53894 0.49957 0.35767
3 4 22 0.26855 0.17487 0.22461
4 4 20 0.15106 0.48767 0.49072
5 7 18 0.11627 0.12604 0.2832
6 7 24 0.50201 0.14252 0.21454
7 12 16 0.27649 0.96008 0.42114
8 12 18 0.60852 0.21637 0.18799
9 22 20 0.32867 0.65308 0.29388
10 22 24 0.25726 0.37048 0.32379", header = TRUE)
A data.table approach, but if someone can offer some improvements I'd be keen to hear.
library(data.table)
dt <- data.table(df)
dt2 <- dt[, lapply(.SD, mean), by = Cue][,ITI := NA][]
data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
> data.table(rbind(dt, dt2))[order(Cue)][is.na(ITI), Cue := NA][]
Cue ITI a b c
1: 0 16 0.820620 0.521850 0.276790
2: 0 24 0.538940 0.499570 0.357670
3: NA NA 0.679780 0.510710 0.317230
4: 4 22 0.268550 0.174870 0.224610
5: 4 20 0.151060 0.487670 0.490720
6: NA NA 0.209805 0.331270 0.357665
If you want to leave the Cue values as-is to confirm group, just drop the [is.na(ITI), Cue := NA] from the last line.
I would use group_by and summarise from the DPLYR package to get a dataframe with the average values. Then rbind the new data frame with the old one and sort by Cue:
df_averages <- df_orig >%>
group_by(Cue) >%>
summarise(ITI = NA, a = mean(a), b = mean(b), c = mean(c)) >%>
ungroup()
df_all <- rbind(df_orig, df_averages)
I have several time serious variables and I want to create two new dummy variables.
Variable one: if other variables contain a specific value, then variable one equal 1.
Variable two: if other variables contain a specific value continuously, then variable two equal 1.
My data looks like
ID score_2011 score_2012 score_2013 score_2014 score_2015
1 12 15 96 96 16
2 12 15 15 15 16
3 12 96 20 15 16
4 12 15 18 15 16
5 12 15 96 15 16
I want to get the new variables like the following
IF score_2011~2015 contain 96 then with_96=1
IF score_2011~2015 contain continuous 96 then back_to_back_96=1
I want the result to look like..
ID score_2011 score_2012 score_2013 score_2014 score_2015 with_96 back_to_back_96
1 12 15 96 96 16 1 1
2 12 15 15 15 16 0 0
3 12 96 20 15 16 1 0
4 12 15 18 15 16 0 0
5 96 15 96 15 16 1 0
Thanks in advance
One option would be to loop through the rows, find if there are any values that are 96 ('x1'), do run-length encoding on each of the rows, check whether there are any of the lengths for the 'TRUE' values are greater than 1 ('x2') , concatenate both, transpose and assign two new columns to the output.
df1[c("with_96", "back_to_back_96")] <- t(apply(df1[-1], 1, FUN= function(x) {
x1 <- as.integer(any(x==96))
rl <- rle(x==96)
x2 <- any(rl$lengths[rl$values]>1)
c(x1, x2)}))
df1
# ID score_2011 score_2012 score_2013 score_2014 score_2015 with_96 back_to_back_96
#1 1 12 15 96 96 16 1 1
#2 2 12 15 15 15 16 0 0
#3 3 12 96 20 15 16 1 0
#4 4 12 15 18 15 16 0 0
#5 5 12 15 96 15 16 1 0
Or another option is using rowSums
df1["with_96"] <- +(!!rowSums(df1[-1]==96))
df1["back_to_back_96"] <- rowSums((df1[-c(1, ncol(df1))]==96) +
(df1[-c(1,2)]==96)>1)
You can do some fanciness with data.table if you are so inclined. Working on a long format, melted dataset might make the logic of some of these comparisons a bit simpler.
library(data.table)
setDT(dat)
melt(dat, id="ID")[, .(with96=any(value==96), b2b96=any(diff(which(value==96))==1)), by=ID]
# ID with96 b2b96
#1: 1 TRUE TRUE
#2: 2 FALSE FALSE
#3: 3 TRUE FALSE
#4: 4 FALSE FALSE
#5: 5 TRUE FALSE
I'm trying to work out whether the next x (6 is the current plan but this could be subject to change) balances remain the same or decrease each month.
I did this in Excel such that it would start with the current month's value and compare next month's against it to see if it stayed the same or decreased and so on.
=IF(AND(H3<=H2,H4<=H3,H5<=H4,H6<=H5,H7<=H6,H8<=H7),1,0)
This isn't the most flexible or elegant formula as it was part of an initial exploration. To make everything cleaner and more reproducible, I'd like to put my calculations into R instead.
Here is a basic dataset that is like my data for multiple IDs over many months.
rbind(data.frame(ID=1,Month=1:11,Bal=seq(from=500, to=300, by=-20)),
data.frame(ID=2,Month=1:10,Bal=rep(200,10)),
data.frame(ID=3,Month=1:11,Bal=seq(from=300, to=500, by=20)))
Having something that calculates against the raw data on a row level or will work inside a ddply are ideal solutions variants.
I'm still pretty new to R and I'm sure there's an elegant solution for this, but I really can't see it. Anyone have a neat solution or could point me in the direction of the sorts of keyterms I should be researching to try and reach a solution?
I am not sure if I understood correctly:
checkfun <- function(x,n) {
rev(filter(rev(c(diff(x) <= 0,NA)),rep(1,n),sides=1)) == n
}
This function calculates the differences between consecutive values and checks if they are <= 0. The filter sums the number of following n differences that fulfill the condition. This number is finally compared with n, to see if all of them fulfill the condition. (rev is only used, so that a one-sided filter can be used.)
DF$Bal[6] <- 505 #to not only have equal differences
library(plyr)
#example with 3 next values
ddply(DF,.(ID),transform,check=checkfun(Bal,3))
# ID Month Bal check
# 1 1 1 500 TRUE
# 2 1 2 480 TRUE
# 3 1 3 460 FALSE
# 4 1 4 440 FALSE
# 5 1 5 420 FALSE
# 6 1 6 505 TRUE
# 7 1 7 380 TRUE
# 8 1 8 360 TRUE
# 9 1 9 340 NA
# 10 1 10 320 NA
# 11 1 11 300 NA
# 12 2 1 200 TRUE
# 13 2 2 200 TRUE
# 14 2 3 200 TRUE
# 15 2 4 200 TRUE
# 16 2 5 200 TRUE
# 17 2 6 200 TRUE
# 18 2 7 200 TRUE
# 19 2 8 200 NA
# 20 2 9 200 NA
# 21 2 10 200 NA
# 22 3 1 300 FALSE
# 23 3 2 320 FALSE
# 24 3 3 340 FALSE
# 25 3 4 360 FALSE
# 26 3 5 380 FALSE
# 27 3 6 400 FALSE
# 28 3 7 420 FALSE
# 29 3 8 440 FALSE
# 30 3 9 460 NA
# 31 3 10 480 NA
# 32 3 11 500 NA
If df is your data.frame:
you can find consecutive differences using:
df$diff <- do.call("c",lapply(unique(df$ID), function(x) c(0,diff(df$Bal[df$ID==x]))))
This assumes that you want to keep those calculations separate for different ID's.
> head(df)
ID Month Bal diff
1 1 1 500 0
2 1 2 480 -20
3 1 3 460 -20
4 1 4 440 -20
5 1 5 420 -20
6 1 6 400 -20
Now, for a give k=6 (say), check:
sapply(unique(df$ID), function(x) ifelse(sum(df$diff[df$ID==x][1:k] < 0)!=0,1,0))
[1] 1 0 0
It returns a value of 1 (all differences are negative) or 0 (all differences are positive) for each ID.