I have following dataframe in R
a b
1 0
2 0
3 0
4 1
5 1
6 1
7 0
8 0
9 0
10 1
11 1
Desired dataframe would be
a b Flag
1 0 1
2 0 2
3 0 3
4 1 4
5 1 4
6 1 4
7 0 5
8 0 6
9 0 7
10 1 8
11 1 8
The sequence should change for 0 and shall remain same for 1.
I am doing it with following command
df$flag <- with(a, match(b, unique(b)))
But,does not give me desired output.
This has been updated to account for the first element of b being 1. Thanks to #tk3 for pointing out that a change was needed.
It looks like your rule is to increase flag if b is zero OR if it is the first 1 in a sequence.
This will give your answer.
cumsum(1 + c(df$b[1],diff(df$b)>0) - df$b)
[1] 1 2 3 4 4 4 5 6 7 8 8
If you just wanted to increase flag when b is zero, you could use
cumsum(1-df$b). Except that would not change the flag for the first one in a series. So I wanted to make an altered version of b that would set b=0 for all of the first ones. You can use c(df$b[1], diff(df$b) >0) to get all of the places that b changed from zero to one - the "first ones". Now
df$b - c(df$b[1],diff(df$b)>0)
0 0 0 0 1 1 0 0 0 0 1
changes all of the "first ones" to zeros unless it is the first element of b. With this altered b we can use cumsum as above. We want to take cumsum of
1 - ( df$b - c(df$b[1],diff(df$b)>0) ) = 1 + c(df$b[1],diff(df$b)>0) - df$b
Which was my response
cumsum(1 + c(df$b[1],diff(df$b)>0) - df$b)
[1] 1 2 3 4 4 4 5 6 7 8 8
The original version worked only for df$b[1] = 0. The updated version should also work for df$b[1] = 1.
The following seems to do what you want.
I find it a bit complicated but it works.
sp <- split(df, cumsum(c(0, abs(diff(df$b)))))
df2 <- lapply(sp, function(DF) {
DF$Flag <- as.integer(DF$b != 1)
if(DF$b[1] == 1) DF$Flag[1] <- 1
DF
})
rm(sp) # clean up
df2 <- do.call(rbind, df2)
df2$Flag <- cumsum(df2$Flag)
row.names(df2) <- NULL
df2
# a b Flag
#1 1 0 1
#2 2 0 2
#3 3 0 3
#4 4 1 4
#5 5 1 4
#6 6 1 4
#7 7 0 5
#8 8 0 6
#9 9 0 7
#10 10 1 8
#11 11 1 8
Related
I have the following dataframe (df):
A B T Required col (window = 3)
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1 4
5 6 0 0 2
6 4 1 1 0
7 7 1 1 1
8 8 1 1 1
9 1 0 0 1
I would like to add the required column, as followed:
Insert in the current row the previous row value of A or B.
If in the last 3 (window) rows most of time the content of A column is equal to T column - choose A, otherwise - B. (There can be more columns - so the content of the column with the most times equal to T will be chosen).
What is the most efficient way to do it for big data table.
I changed the column named T to be named TC to avoid confusion with T as an abbreviation for TRUE
library(tidyverse)
library(data.table)
df[, newcol := {
equal <- A == TC
map(1:.N, ~ if(.x <= 3) NA
else if(sum(equal[.x - 1:3]) > 3/2) A[.x - 1]
else B[.x - 1])
}]
df
# N A B TC newcol
# 1: 1 1 0 1 NA
# 2: 2 3 0 3 NA
# 3: 3 4 0 4 NA
# 4: 4 2 1 1 4
# 5: 5 6 0 0 2
# 6: 6 4 1 1 0
# 7: 7 7 1 1 1
# 8: 8 8 1 1 1
# 9: 9 1 0 0 1
This works too, but it's less clear, and likely less efficient
df[, newcol := shift(A == TC, 1:3) %>%
pmap_lgl(~sum(...) > 3/2) %>%
ifelse(shift(A), shift(B))]
data:
df <- fread("
N A B TC
1 1 0 1
2 3 0 3
3 4 0 4
4 2 1 1
5 6 0 0
6 4 1 1
7 7 1 1
8 8 1 1
9 1 0 0
")
Probably much less efficient than the answer by Ryan, but without additional packages.
A<-c(1,3,4,2,6,4,7,8,1)
B<-c(0,0,0,1,0,1,1,1,0)
TC<-c(1,3,4,1,0,1,1,1,0)
req<-rep(NA,9)
df<-data.frame(A,B,TC,req)
window<-3
for(i in window:(length(req)-1)){
equal <- sum(df$A[(i-window+1):i]==df$TC[(i-window+1):i])
if(equal > window/2){
df$req[i+1]<-df$A[i]
}else{
df$req[i+1]<-df$B[i]
}
}
I've got a data frame containing values relating to observations, 1 or 0. I want to count the continual occurrences of 1, resetting at 0. The run length encoding function (rle) seems like it would do the work but I can't work out getting the data into the desired format. I want to try doing this without writing a custom function. In the data below, I have observation in a data frame, then I want to derive the "continual" column and write back to the dataframe. This link was a good start.
observation continual
0 0
0 0
0 0
1 1
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
1 11
1 12
0 0
0 0
You can do this pretty easily in a couple of steps:
x <- rle(mydf$observation) ## run rle on the relevant column
new <- sequence(x$lengths) ## create a sequence of the lengths values
new[mydf$observation == 0] <- 0 ## replace relevant values with zero
new
# [1] 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 0
Using the devel version you could try
library(data.table) ## v >= 1.9.5
setDT(df)[, continual := seq_len(.N) * observation, by = rleid(observation)]
There is probably a better way, but:
g <- c(0,cumsum(abs(diff(df$obs))))
df$continual <- ave(g,g,FUN=seq_along)
df$continual[df$obs==0] <- 0
Simply adapting the accepted answer from the question you linked:
unlist(mapply(function(x, y) seq(x)*y, rle(df$obs)$lengths, rle(df$obs)$values))
# [1] 0 0 0 1 2 3 4 5 6 7 8 9 10 11 12 0 0
You can use a simple base R one liner, using the fact observation contains only 0 and 1 , coupled with a vectorized operation:
transform(df, continual=ifelse(observation, cumsum(observation), observation))
# observation continual
#1 0 0
#2 0 0
#3 0 0
#4 1 1
#5 1 2
#6 1 3
#7 1 4
#8 1 5
#9 1 6
#10 1 7
#11 1 8
#12 1 9
#13 1 10
#14 1 11
#15 1 12
#16 0 0
#17 0 0
I want to add a column to a data frame based on the values in an other column. I want a specific value for the first time a value appears in the other column only. For example:
s <- c(6,5,6,7,8,7,6,5)
i <- c(4,5,4,3,2,3,4,5)
t <- c(1,1,3,4,5,6,6,8)
df<- data.frame(t,s,i)
> df
t s i
1 1 6 4
2 1 5 5
3 3 6 4
4 4 7 3
5 5 8 2
6 6 7 3
7 6 6 4
8 8 5 5
Now I want to add a column "mark" that gives a 1 for the first time t=1 and the first time t=6. So that I get: 1 0 0 0 0 1 0 0. I have this code:
for(i in 1:nrow(df)){
if (df$t[i] == 1 & df$t[i-1] != 1 | (df$t[i] == 6 & df$t[i-1] != 6)){
df$mark[i] <- 1
} else {
df$mark[i] <- 0
}
}
This however gives the following error:
Error in if (df$t[i] == 1 & df$t[i - 1] != 1 | (df$t[i] == 6 & df$t[i - :argument is of length zero
Can anyone tell me what is going wrong?
Don't use loops, just do
df$mark <- 0
df$mark[match(c(1, 6), df$t)] <- 1
from ?match documentation
match returns a vector of the positions of (first) matches of its
first argument in its second.
The reason you are getting an error in your loop is because you are looping from 1 to nrow(df). But in your loop you are specifying df$t[i-1], which basically means df$t[0] in your first iteration; which is a non-existing entry
within(df, mark<- (c(1,diff(t %in% c(1,6)))==1) +0)
# t s i mark
# 1 1 6 4 1
# 2 1 5 5 0
# 3 3 6 4 0
# 4 4 7 3 0
# 5 5 8 2 0
# 6 6 7 3 1
# 7 6 6 4 0
# 8 8 5 5 0
Or
duplicated(df$t,fromLast=T) +0
#[1] 1 0 0 0 0 1 0 0
I would like to do subsequent row summing of a columnvalue and put the result into a new columnvariable without deleting any row by another columnvalue .
Below is some R-code and an example that does the trick and hopefully illustrates my question. I was wondering if there is a more elegant way to do since the for loop will be time consuming in my actual object.
Thanks for any feedback.
As an example dataframe:
MyDf <- data.frame(ID = c(1,1,1,2,2,2), Y = 1:6)
MyDf$FIRST <- c(1,0,0,1,0,0)
MyDf.2 <- MyDf
MyDf.2$Y2 <- c(1,3,6,4,9,15)
The purpose of this is so that I can write code that calculates Y2 in MyDf.2 above for each ID, separately.
This is what I came up with and, it does the trick. (Calculating a TEST column in MyDf that has to be equal to Y2 cin MyDf.2)
MyDf$TEST <- NA
for(i in 1:length(MyDf$Y)){
MyDf[i,]$TEST <- ifelse(MyDf[i,]$FIRST == 1, MyDf[i,]$Y,MyDf[i,]$Y + MyDf[i-1,]$TEST)
}
MyDf
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
MyDf.2
ID Y FIRST Y2
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
You need ave and cumsum to get the column you want. transform is just to modify your existing data.frame.
> MyDf <- transform(MyDf, TEST=ave(Y, ID, FUN=cumsum))
ID Y FIRST TEST
1 1 1 1 1
2 1 2 0 3
3 1 3 0 6
4 2 4 1 4
5 2 5 0 9
6 2 6 0 15
I want to subtract the smallest value in each subset of a data frame from each value in that subset i.e.
A <- c(1,3,5,6,4,5,6,7,10)
B <- rep(1:4, length.out=length(A))
df <- data.frame(A, B)
df <- df[order(B),]
Subtracting would give me:
A B
1 0 1
2 3 1
3 9 1
4 0 2
5 2 2
6 0 3
7 1 3
8 0 4
9 1 4
I think the output you show is not correct. In any case, from what you explain, I think this is what you want. This uses ave base function:
within(df, { A <- ave(A, B, FUN=function(x) x-min(x))})
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4
Of course there are other alternatives such as plyr and data.table.
Echoing Arun's comment above, I think your expected output might be off. In any event, you should be able to use can use tapply to calculate subsets and then use match to line those subsets up with the original values:
subs <- tapply(df$A, df$B, min)
df$A <- df$A - subs[match(df$B, names(subs))]
df
A B
1 0 1
5 3 1
9 9 1
2 0 2
6 2 2
3 0 3
7 1 3
4 0 4
8 1 4