I need to implement a counter that decrements dec_cntby 1 based on certain conditions.
Below is my dataframe df.
ID A
1 0
2 0
3 0
4 1
5 1
6 0
7 0
8 0
9 0
10 0
11 0
12 0
13 0
14 0
15 0
16 -1
17 1
18 0
19 1
20 0
21 -1
22 0
23 0
24 -1
25 0
26 0
27 0
28 0
29 0
30 0
31 0
32 0
33 0
34 0
The conditions are
a. The counter should start from the data point where the A==1 or -1 and start decrementing the counter for next 16 values,for example value of A == 1 at ID 4, so from ID == 4 till ID==19 the decrement counter should be implemented starting from value 15 till counter is 0. Also to note that if there exists any A== 1/-1 in between this range it should be ignored.
b. I also need to implement retain_A column which retains the value of A through out the counter.
Below is my expected output.
ID A retain_A dec_cnt
1 0 NA NA
2 0 NA NA
3 0 NA NA
4 1 1 15
5 1 1 14
6 0 1 13
7 0 1 12
8 0 1 11
9 0 1 10
10 0 1 9
11 0 1 8
12 0 1 7
13 0 1 6
14 0 1 5
15 0 1 4
16 -1 1 3
17 1 1 2
18 0 1 1
19 1 1 0
20 0 NA NA
21 -1 -1 15
22 0 -1 14
23 0 -1 13
24 -1 -1 12
25 0 -1 11
26 0 -1 10
27 0 -1 9
28 0 -1 8
29 0 -1 7
30 0 -1 6
31 0 -1 5
32 0 -1 4
33 0 -1 3
34 0 -1 2
The similar kind of question had been posted couple of days ago where the solution uses for loop, Also the loop fails to execute if the data points are more than 35. I wanted to avoid for loop because its execution time will be more if we are dealing with huge amount of data.
The data frame is take from the question posted here
below is the script that I tried using the above referenced post.
dec_cnt <- 0
Retain_A <- NA
for (i in seq_along(df$A)) {
if (dec_cnt == 0) {
if (df$A[i] == 0) next
dec_cnt <- 15
Retain_A <- df$A[i]
df$Retain_A[i] <- df$A[i]
df$dec_cnt[i] <- dec_cnt
} else {
dec_cnt <- dec_cnt - 1
df$Retain_A[i] <- Retain_A
df$dec_cnt[i] <- dec_cnt
}
}
I don't think it's realistic to avoid any kind of loop, for or otherwise. Perhaps a more realistic goal would be to avoid loops that iterate over every single value, regardless of whether it is relevant.
Starting from your 2-column input, let's pre-set the empty columns:
dat$retain_A <- NA
dat$dec_cnt <- NA
Here's where we can gain some efficiency: instead of repeatedly making comparisons, we can know if it matches -1/1 now:
ind <- which(dat$A %in% c(-1,1))
last_match <- 0
ind
# [1] 4 5 16 17 19 21 24
The trick is to keep track of the last_match and discard any indices between it and the next 15 entries.
ind <- ind[ind > last_match]
while (length(ind) > 0) {
i <- seq(ind[1], min(ind[1] + 15, nrow(dat)))
dat$dec_cnt[i] <- head(15:0, n = length(i))
dat$retain_A[i] <- dat$A[ ind[1] ]
last_match <- ind[1] + 15
ind <- ind[ind > last_match]
}
dat
# ID A retain_A dec_cnt
# 1 1 0 NA NA
# 2 2 0 NA NA
# 3 3 0 NA NA
# 4 4 1 1 15
# 5 5 1 1 14
# 6 6 0 1 13
# 7 7 0 1 12
# 8 8 0 1 11
# 9 9 0 1 10
# 10 10 0 1 9
# 11 11 0 1 8
# 12 12 0 1 7
# 13 13 0 1 6
# 14 14 0 1 5
# 15 15 0 1 4
# 16 16 -1 1 3
# 17 17 1 1 2
# 18 18 0 1 1
# 19 19 1 1 0
# 20 20 0 NA NA
# 21 21 -1 -1 15
# 22 22 0 -1 14
# 23 23 0 -1 13
# 24 24 -1 -1 12
# 25 25 0 -1 11
# 26 26 0 -1 10
# 27 27 0 -1 9
# 28 28 0 -1 8
# 29 29 0 -1 7
# 30 30 0 -1 6
# 31 31 0 -1 5
# 32 32 0 -1 4
# 33 33 0 -1 3
# 34 34 0 -1 2
You'll find that your initial loop iterates once per row whereas this solution iterates only once per non-zero.
Related
I have a dataframe with repeated IDs, I want to get a colunm that contains the very first occurance of each ID.
For instance:
ID <- as.character( "ae61_10", "ae61_10", "ae61_10", "ae61_10", "ae61_10", "ae61_1", "ae61_1", "ae61_1", "ae61_11", "ae61_2", "ae61_2", "ae61_3", "ae61_4", "ae61_4", "ae61_5", "ae61_6", "ae61_7", "ae61_8", "ae61_8", "ae61_8", "ae61_9")
df <- data.frame(ID)
df
ID
1 ae61_10
2 ae61_10
3 ae61_10
4 ae61_10
5 ae61_10
6 ae61_1
7 ae61_1
8 ae61_1
9 ae61_11
10 ae61_2
11 ae61_2
12 ae61_3
13 ae61_4
14 ae61_4
15 ae61_5
16 ae61_6
17 ae61_7
18 ae61_8
19 ae61_8
20 ae61_8
21 ae61_9
I want to get the colunm FIRST_OC, which returns 1 if that is the first ID occurance, and 0 if it is not. Like this:
ID FIRST_OC
1 ae61_10 1
2 ae61_10 0
3 ae61_10 0
4 ae61_10 0
5 ae61_10 0
6 ae61_1 1
7 ae61_1 0
8 ae61_1 0
9 ae61_11 1
10 ae61_2 1
11 ae61_2 0
12 ae61_3 1
13 ae61_4 1
14 ae61_4 0
15 ae61_5 1
16 ae61_6 1
17 ae61_7 1
18 ae61_8 1
19 ae61_8 0
20 ae61_8 0
21 ae61_9 1
I have tried this command, but it did not work:
df$FIRST_OC <- 0
FIRST_OC <- df[match(unique(df$ID), df$ID),]
df$FIRST_OC[which(df$ID %in% FIRST_OC)] <- 1
Could somebody help me to figure out what is going wrong?
Many thanks.
duplicated is the function to use!
df$FIRST_OC <- ifelse(!duplicated(df$ID), 1, 0)
# OR
df$FIRST_OC <- as.numeric(!duplicated(df$ID))
We can use duplicated
+(!duplicated(df$ID))
#[1] 1 0 0 0 0 1 0 0 1 1 0 1 1 0 1 1 1 1 0 0 1
I need to implement a logic in my R script for the below shown sample data frame. df
ID A B
1 2.471264262 0
2 2.53024575 0
3 2.559114933 1
4 2.502350493 1
5 2.529496526 0
6 2.480199137 0
7 2.521066835 0
8 2.481272625 0
9 2.505953959 0
10 2.481272625 0
11 2.499424723 0
12 2.492515087 0
13 2.502385996 0
14 2.487579633 0
15 2.479438021 -1
16 2.044195946 1
17 2.054051421 0
18 2.108811073 1
19 2.249767599 0
20 2.627294516 -1
21 2.624337386 0
22 2.157110862 0
23 2.142325212 -1
24 2.124582433 -1
25 2.114725333 0
26 2.113739623 0
27 1.92054047 0
28 2.00037188 0
29 2.183995509 0
30 2.629451192 0
31 2.772756046 0
32 2.603141474 0
33 2.502385996 0
Column B shows the data point where State is changed. Now I need to implement a complex logic where I will be adding or subtracting the "Correction Factor" for the values in Column A for next 15 data points from the point where B == 1 or -1.
The formula for the correction factor is as follows,
If B == 1 then Correction Factor == [A - 0.19*(15/15)*A], Also value the fraction (15/15) will keep on decrementing for the next 15 values like (14/15) , (13/15) .....(0/15).
Similarly if B == -1 then Correction Factor == [A + 0.53*(15/15)*A], Also value (15/15) will keep on decrementing for the next 15 values like (14/15) , (13/15) .....(0/15).
And another condition to consider is that, Once a state change has be detected in B then though there is state change with in the next 15 values, it should not be considered. Ex First change in state is detected at B3 then though there is state change in B4,B15,16 it should not be considered.
For a better Understanding I have attached my expected output along with the formulas executed manually in excel.
Expected Output
A B A With Correction Factor Formula Executed
2.471264262 0 2.471264262 Same Value of A retained since no transition
2.53024575 0 2.53024575 Same Value of A retained since no transition
2.559114933 1 2.072883096 A4-0.19* (15/15)*A4
2.502350493 1 2.058600339 A5-0.19* (14/15)*A5
2.529496526 0 2.112972765 A6-0.19* (13/15)*A6
2.480199137 0 2.103208868 A7-0.19* (12/15)*A7
2.521066835 0 2.169798189 A8-0.19* (11/15)*A8
2.481272625 0 2.166978093 A9-0.19* (10/15)*A9
2.505953959 0 2.220275208 A10-0.19* (9/15)*A10
2.481272625 0 2.229836999 A11-0.19* (8/15)*A11
2.499424723 0 2.277809064 A12-0.19* (7/15)*A12
2.492515087 0 2.30308394 A13-0.19* (6/15)*A13
2.502385996 0 2.34390155 A14-0.19* (5/15)*A14
2.487579633 0 2.361542265 A15-0.19* (4/15)*A15
2.479438021 -1 2.385219376 A16-0.19* (3/15)*A16
2.044195946 1 1.992409649 A17-0.19* (2/15)*A17
2.054051421 0 2.028033436 A18-0.19* (1/15)*A18
2.108811073 1 2.108811073 A19-0.19* (0/15)*A19
2.249767599 0 2.249767599 Same Value of A retained since no transition
2.627294516 -1 4.019760609 A21+0.53*(15/15)*A21
2.624337386 0 3.922509613 A22+0.53*(14/15)*A22
2.157110862 0 3.147943785 A23+0.53*(13/15)*A23
2.142325212 -1 3.050671102 A24+0.53*(12/15)*A24
2.124582433 -1 2.950336805 A25+0.53*(11/15)*A25
2.114725333 0 2.861928284 A26+0.53*(10/15)*A26
2.113739623 0 2.785908823 A27+0.53*(9/15)*A27
1.92054047 0 2.463413243 A28+0.53*(8/15)*A28
2.00037188 0 2.495130525 A29+0.53*(7/15)*A29
2.183995509 0 2.647002557 A30+0.53*(6/15)*A30
2.629451192 0 3.093987569 A31+0.53*(5/15)*A31
2.772756046 0 3.164638901 A32+0.53*(4/15)*A32
2.603141474 0 2.87907447 A33+0.53*(3/15)*A33
2.502385996 0 2.679221273 A34+0.53*(2/15)*A34
Edit
The code suggested below works exactly as required for the above mentioned dataframe i.e the dataframe with 33 rows, but I have the below data frame with 32rows and code doesnt work. Any suggestion on this?
ID A B
1 2.471264262 0
2 2.53024575 0
3 2.559114933 1
4 2.502350493 1
5 2.529496526 0
6 2.480199137 0
7 2.521066835 0
8 2.481272625 0
9 2.505953959 0
10 2.481272625 0
11 2.499424723 0
12 2.492515087 0
13 2.502385996 0
14 2.487579633 0
15 2.479438021 -1
16 2.044195946 1
17 2.054051421 0
18 2.108811073 1
19 2.249767599 0
20 2.627294516 -1
21 2.624337386 0
22 2.157110862 0
23 2.142325212 -1
24 2.124582433 -1
25 2.114725333 0
26 2.113739623 0
27 1.92054047 0
28 2.00037188 0
29 2.183995509 0
30 2.629451192 0
31 2.772756046 0
32 2.603141474 0
Well I was not able to post another question giving this post as the reference so I have updated iin the same post.
Thanks.
This should work, the counting to 15 is a little tricky, so we use a for loop to calculate the correct counter and state. The actual formula is then relatively simple:
counter <- 0
current_state <- NA
for (i in seq_along(df$B)) {
if (counter == 0) {
if (df$B[i] == 0) next
counter <- 15
current_state <- df$B[i]
df$state[i] <- df$B[i]
df$counter[i] <- counter
} else {
counter <- counter - 1
df$state[i] <- current_state
df$counter[i] <- counter
}
}
df$A_corr <- ifelse(df$state == 1,
df$A - 0.19 * (df$counter / 15) * df$A,
df$A + 0.53 * (df$counter / 15) * df$A)
df$A_corr <- ifelse(is.na(df$A_corr), df$A, df$A_corr)
Gives:
> df
ID A B state counter A_corr
1 1 2.471264 0 NA NA 2.471264
2 2 2.530246 0 NA NA 2.530246
3 3 2.559115 1 1 15 2.072883
4 4 2.502350 1 1 14 2.058600
5 5 2.529497 0 1 13 2.112973
6 6 2.480199 0 1 12 2.103209
7 7 2.521067 0 1 11 2.169798
8 8 2.481273 0 1 10 2.166978
9 9 2.505954 0 1 9 2.220275
10 10 2.481273 0 1 8 2.229837
11 11 2.499425 0 1 7 2.277809
12 12 2.492515 0 1 6 2.303084
13 13 2.502386 0 1 5 2.343902
14 14 2.487580 0 1 4 2.361542
15 15 2.479438 -1 1 3 2.385219
16 16 2.044196 1 1 2 1.992410
17 17 2.054051 0 1 1 2.028033
18 18 2.108811 1 1 0 2.108811
19 19 2.249768 0 NA NA 2.249768
20 20 2.627295 -1 -1 15 4.019761
21 21 2.624337 0 -1 14 3.922510
22 22 2.157111 0 -1 13 3.147944
23 23 2.142325 -1 -1 12 3.050671
24 24 2.124582 -1 -1 11 2.950337
25 25 2.114725 0 -1 10 2.861928
26 26 2.113740 0 -1 9 2.785909
27 27 1.920540 0 -1 8 2.463413
28 28 2.000372 0 -1 7 2.495131
29 29 2.183996 0 -1 6 2.647003
30 30 2.629451 0 -1 5 3.093988
31 31 2.772756 0 -1 4 3.164639
32 32 2.603141 0 -1 3 2.879074
33 33 2.502386 0 -1 2 2.679221
I have a file like this in R.
**0 1**
0 2
**0 3**
0 4
0 5
0 6
0 7
0 8
0 9
0 10
**1 0**
1 11
1 12
1 13
1 14
1 15
1 16
1 17
1 18
1 19
**3 0**
As we can see, there are similar unordered pairs in this ( marked pairs ), like,
1 0
and
0 1
I wish to remove these pairs. And I want to count the number of such pairs that I have and append the count in front of the tow that is repeated. If not repeated, then 1 should be written in the third column.
For example ( A sample of the output file )
0 1 2
0 2 1
0 3 2
0 4 1
0 5 1
0 6 1
0 7 1
0 8 1
0 9 1
0 10 1
1 11 1
1 12 1
1 13 1
1 14 1
1 15 1
1 16 1
1 17 1
1 18 1
1 19 1
How can I achieve it in R?
Here is a way using transform, pmin and pmax to reorder the data by row, and then aggregate to provide a count:
# data
x <- data.frame(a=c(rep(0,10),rep(1,10),3),b=c(1:10,0,11:19,0))
#logic
aggregate(count~a+b,transform(x,a=pmin(a,b), b=pmax(a,b), count=1),sum)
a b count
1 0 1 2
2 0 2 1
3 0 3 2
4 0 4 1
5 0 5 1
6 0 6 1
7 0 7 1
8 0 8 1
9 0 9 1
10 0 10 1
11 1 11 1
12 1 12 1
13 1 13 1
14 1 14 1
15 1 15 1
16 1 16 1
17 1 17 1
18 1 18 1
19 1 19 1
Here's one approach:
First, create a vector of the columns sorted and then pasted together.
x <- apply(mydf, 1, function(x) paste(sort(x), collapse = " "))
Then, use ave to create the counts you are looking for.
mydf$count <- ave(x, x, FUN = length)
Finally, you can use the "x" vector again, this time to detect and remove duplicated values.
mydf[!duplicated(x), ]
# V1 V2 count
# 1 0 1 2
# 2 0 2 1
# 3 0 3 2
# 4 0 4 1
# 5 0 5 1
# 6 0 6 1
# 7 0 7 1
# 8 0 8 1
# 9 0 9 1
# 10 0 10 1
# 12 1 11 1
# 13 1 12 1
# 14 1 13 1
# 15 1 14 1
# 16 1 15 1
# 17 1 16 1
# 18 1 17 1
# 19 1 18 1
# 20 1 19 1
I have a data.frame with a factor identifying events
year event
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 1
18 0
19 0
20 0
And I would need a counter-type identifying a given window around the events. The result should look like this (for a window that is, for example, 3 periods around the event):
year event window
1 0
2 0
3 0
4 0
5 0
6 0 -3
7 0 -2
8 0 -1
9 1 0
10 0 1
11 0 2
12 0 3
13 0
14 0 -3
15 0 -2
16 0 -1
17 1 0
18 0 1
19 0 2
20 0 3
Any guidance on how to implement this within a function would be appreciated. You can copy the data. frame, pasting the block above in "..." here:
dt <- read.table( text="...", , header = TRUE )
Assuming there is no overlapping, you can use on of my favourite base functions, filter:
DF <- read.table(text="year event
1 0
2 0
3 0
4 0
5 0
6 0
7 0
8 0
9 1
10 0
11 0
12 0
13 0
14 0
15 0
16 0
17 1
18 0
19 0
20 0", header=TRUE)
DF$window <- head(filter(c(rep(0, 3), DF$event, rep(0, 3)),
filter=-3:3)[-(1:3)], -3)
DF$window[DF$window == 0 & DF$event==0] <- NA
# year event window
# 1 1 0 NA
# 2 2 0 NA
# 3 3 0 NA
# 4 4 0 NA
# 5 5 0 NA
# 6 6 0 -3
# 7 7 0 -2
# 8 8 0 -1
# 9 9 1 0
# 10 10 0 1
# 11 11 0 2
# 12 12 0 3
# 13 13 0 NA
# 14 14 0 -3
# 15 15 0 -2
# 16 16 0 -1
# 17 17 1 0
# 18 18 0 1
# 19 19 0 2
# 20 20 0 3
Working in R.
The data tracks changes in brain activity over time. Column "mark" contains information when a particular treatment begins and ends. For examples, the first condition (mark==1) begins in row 3 and ends in row 6. The second experimental condition (mark==2) starts in row 9 and ends in 12. Another batch of treatment one is repeated between rows 15 and 18.
ob.id <- c(1:20)
mark <- c(0,0,1,0,0,1,0,0,2,0,0,2,0,0,1,0,0,1,0,0)
condition<-c(0,0,1,1,1,1,0,0,2,2,2,2,0,0,1, 1,1,1,0,0)
start <- data.frame(ob.id,mark)
result<-data.frame(ob.id,mark,condition)
print (start)
> print (start)
ob.id mark
1 1 0
2 2 0
3 3 1
4 4 0
5 5 0
6 6 1
7 7 0
8 8 0
9 9 2
10 10 0
11 11 0
12 12 2
13 13 0
14 14 0
15 15 1
16 16 0
17 17 0
18 18 1
19 19 0
20 20 0
I need to create a column that would have a dummy variable indicating the membership of an observation in corresponding experimental condition, like this:
> print(result)
ob.id mark condition
1 1 0 0
2 2 0 0
3 3 1 1
4 4 0 1
5 5 0 1
6 6 1 1
7 7 0 0
8 8 0 0
9 9 2 2
10 10 0 2
11 11 0 2
12 12 2 2
13 13 0 0
14 14 0 0
15 15 1 1
16 16 0 1
17 17 0 1
18 18 1 1
19 19 0 0
20 20 0 0
Thanks for your help!
This is a fun little problem. The trick I use below is to first calculate the rle of the mark vector, which makes the problem simpler, as the resulting values vector will always have just one 0 that may or may not need to be replaced (depending on the surrounding values).
# example vector with some edge cases
v = c(0,0,1,0,0,0,1,2,0,0,2,0,0,1,0,0,0,0,1,2,0,2)
v.rle = rle(v)
v.rle
#Run Length Encoding
# lengths: int [1:14] 2 1 3 1 1 2 1 2 1 4 ...
# values : num [1:14] 0 1 0 1 2 0 2 0 1 0 ...
vals = rle(v)$values
# find the 0's that need to be replaced and replace by the previous value
idx = which(tail(head(vals,-1),-1) == 0 & (head(vals,-2) == tail(vals,-2)))
vals[idx + 1] <- vals[idx]
# finally go back to the original vector
v.rle$values = vals
inverse.rle(v.rle)
# [1] 0 0 1 1 1 1 1 2 2 2 2 0 0 1 1 1 1 1 1 2 2 2
Probably the least cumbersome thing to do is to put the above in a function and then apply that to your data.frame vector (as opposed to manipulating the vector explicitly).
Another approach, based on #SimonO101's observation, involves constructing the right groups from the starting data (run the by part separately, piece by piece, to see how it works):
library(data.table)
dt = data.table(start)
dt[, result := mark[1],
by = {tmp = rep(0, length(mark));
tmp[which(mark != 0)[c(F,T)]] = 1;
cumsum(mark != 0) - tmp}]
dt
# ob.id mark result
# 1: 1 0 0
# 2: 2 0 0
# 3: 3 1 1
# 4: 4 0 1
# 5: 5 0 1
# 6: 6 1 1
# 7: 7 0 0
# 8: 8 0 0
# 9: 9 2 2
#10: 10 0 2
#11: 11 0 2
#12: 12 2 2
#13: 13 0 0
#14: 14 0 0
#15: 15 1 1
#16: 16 0 1
#17: 17 0 1
#18: 18 1 1
#19: 19 0 0
#20: 20 0 0
The latter approach will probably be more flexible.
Here is one way I could think of doing it:
# Find where experiments stop and start
ind <- which( result$mark != 0 )
[1] 3 6 9 12 15 18
# Make a matrix of the start and stop indices taking odd and even elements of the vector
idx <- cbind( head(ind , -1)[ 1:length(ind) %% 2 == 1 ] ,tail( ind , -1)[ 1:length(ind) %% 2 == 1 ] )
[,1] [,2]
[1,] 3 6
[2,] 9 12
[3,] 15 18
edit
I realised making the above index matrix would be easier with just taking odd and even elements:
idx <- cbind( ind[ 1:length(ind) %% 2 == 1 ] , ind[ 1:length(ind) %% 2 != 1 ] )
# Make vector of row indices to turn to 1's
ones <- as.vector( apply( idx , 1 , function(x) c( x[1]:x[2] ) ) )
# Make your new column and turn appropriate rows to 1
result$condition <- 0
result$condition[ ones ] <- 1
result
# ob.id mark condition
#1 1 0 0
#2 2 0 0
#3 3 1 1
#4 4 1 1
#5 5 1 1
#6 6 1 1
#7 7 0 0
#8 8 0 0
#9 9 1 1
#10 10 1 1
#11 11 1 1
#12 12 1 1
#13 13 0 0
#14 14 0 0
#15 15 1 1
#16 16 1 1
#17 17 1 1
#18 18 1 1
#19 19 0 0
#20 20 0 0
Edit
#eddi pointed out I needed to put the value of the experiment in, not just one. So this is another strategy which uses gasp(!) a for loop. This will only be really detrimental if you have millions thousands of experiments (remember to pre-allocate your results vector):
ind <- matrix( which( start$mark != 0 ) , ncol = 2 , byrow = TRUE )
ind <- cbind( ind , start$mark[ ind[ , 1 ] ] )
# [,1] [,2] [,3]
#[1,] 3 6 1
#[2,] 9 12 2
#[3,] 15 18 1
res <- integer( nrow( start ) )
for( i in 1:nrow(ind) ){
res[ ind[i,1]:ind[i,2] ] <- ind[i,3]
}
[1] 0 0 1 1 1 1 0 0 2 2 2 2 0 0 1 1 1 1 0 0