find the place where the variable in a dataframe changes its value [duplicate] - r

This question already has answers here:
Finding the index of first changes in the elements of a vector
(5 answers)
Closed 4 years ago.
I have a lot of data frames in R which look like that:
A B
1 0
2 0
3 0
4 1
5 1
6 1
So between 3 and 4 the B changes value from 0 to 1. What is the most R way of returning the value of A where B changes value?
In the data B changes the value only once, and A is sorted (from 1 to n).

Here is a possible way. Use diff to get the values where column b changes but be carefull, the first value of b, by definition of change, hasn't changed. (The problem is that diff returns a vector with one less element.)
inx <- c(FALSE, diff(data$b) != 0)
data[inx, ]
# a b
#4 4 1
After seeing the OP's comment to another post, the following code shows that this method can also solve the issue when b starts with any value,not just zero.
data2 <- data.frame(a=c(1,2,3,4,5,6),b=c(1,1,1,0,0,0))
inx <- c(FALSE, diff(data2$b) != 0)
data2[inx, ]
# a b
#4 4 0

As OP mentioned,
In the data B changes the value only once
We can use cumsum with duplicated and which.max
which.max(cumsum(!duplicated(df$B)))
#[1] 4
If the value changes multiple times, this will give the index for last change instead.
If we need to subset the row, then we can do
df[which.max(cumsum(!duplicated(df$B))), ]
# A B
#4 4 1
To break it down further, for better understanding
!duplicated(df$B)
#[1] TRUE FALSE FALSE TRUE FALSE FALSE
cumsum(!duplicated(df$B))
#[1] 1 1 1 2 2 2
which.max(cumsum(!duplicated(df$B)))
#[1] 4

In order to identify a change in a sequence, one may use diff, like in the following code:
my_df <- data.frame(A = 1:6, B = c(0,0,0,1,1,1))
which(diff(my_df$B)==1)+1
[1] 4

Related

R set column to maximum of current entry and specified value in an elegant way [duplicate]

This question already has answers here:
R: Get the min/max of each item of a vector compared to single value
(1 answer)
Replace negative values by zero
(5 answers)
Closed 1 year ago.
NOTE: I technically know how to do this, but I feel like there has to be a "nicer" way to do this. If such questions are not allowed here just delete it, but I would really like to improve my R style, so any suggestions are welcome.
I have a dataframe data <- data.frame(foo=rep(c(-1,2),5))
foo
1 -1
2 2
3 -1
4 2
5 -1
6 2
7 -1
8 2
9 -1
10 2
Now I would like to be able to set the entries of foo to a certain value (for this example, let's say 1) if the current entry is smaller than that value.
So my desired output would be
foo
1 1
2 2
3 1
4 2
5 1
6 2
7 1
8 2
9 1
10 2
I feel like there should be something like data$foo <- max(data$foo,1) that does the job (but ofc, it "maxes" over the whole column).
Is there an elegant way to do this?
data$foo <- ifelse(data$foo < 1,1,data$foo) and data$foo <- lapply(data$foo,function(x) max(1,x)) just feel somewhat "ugly".
max gives you maximum of the whole column but for your case you need pmax(parallel maximum) so it gives you maximum of 1 or each number in the vector.
data$foo <- pmax(data$foo, 1)
data
# foo
#1 1
#2 2
#3 1
#4 2
#5 1
#6 2
#7 1
#8 2
#9 1
#10 2
This works:
data <- data.frame(foo=rep(c(-1,2),5))
val <- 1
data[data$foo < val, ] <- val
Let's break this down. data$foo takes the column and makes it into a vector. data$foo < val checks which elements of this vector are smaller than val, creating a new vector of similar lenghts filled with TRUE and FALSE at the correct positions.
Finally, the entire line data[data$foo < val, ] <- val uses that vector of TRUE and FALSE to select the rows (using the [, ]) of data to which val is now used.

Assigning vector elements a value associated with preceding matching value [duplicate]

This question already has answers here:
Calculating cumulative sum for each row
(6 answers)
Sum of previous rows in a column R
(1 answer)
Closed 3 years ago.
I have a vector of alternating TRUE and FALSE values:
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
I'd like to number each instance of TRUE with a unique sequential number and to assign each FALSE value the number associated with the TRUE value preceding it.
therefore, my desired output using the example dat above (which has 4 TRUE values):
1 1 1 2 2 2 2 3 3 4 4 4 4 4
What I tried:
I've tried the following (which works), but I know there must be a simpler solution!!
whichT <- which(dat==T)
whichF <- which(dat==F)
l1 <- lapply(1:length(whichT),
FUN = function(x)
which(whichF > whichT[x] & whichF < whichT[(x+1)])
)
l1[[length(l1)]] <- which(whichF > whichT[length(whichT)])
replaceFs <- unlist(
lapply(1:length(whichT),
function(x) l1[[x]] <- rep(x,length(l1[[x]]))
)
)
replaceTs <- 1:length(whichT)
dat2 <- dat
dat2[whichT] <- replaceTs
dat2[whichF] <- replaceFs
dat2
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
I need a simpler and quicker solution b/c my real data set is 181k rows long!
Base R solutions preferred, but any solution works
cumsum(dat) will do what you want. When used in mathematical functions TRUE gets converted to 1 and FALSE to 0 so taking the cumulative sum will add 1 every time you see a TRUE and add nothing when there is a FALSE which is what you want.
dat <- c(T,F,F,T,F,F,F,T,F,T,F,F,F,F)
cumsum(dat)
# [1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
Instead of doing the indexing, it can be easily done with cumsum from base R. Here, TRUE/FALSE gets coerced to 1/0 and when we do the cumulative sum, whereever there is 1, it gets increment by 1
cumsum(dat)
#[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4
cumsum() is the most straightforward way, however, you can also do:
Reduce("+", dat, accumulate = TRUE)
[1] 1 1 1 2 2 2 2 3 3 4 4 4 4 4

How can i multiply specific number in vector using R

I have task to multiply numbers in vector, but only those that can be divided by 3 modulo 0. I figured out how to replace certain elements in vector by different numbers, but it works only if i replace with certain number. I wasn't able to find any answer here http://www.r-tutor.com/r-introduction/vector or even on this site. Everyone only extracting values to another vector.
x <- c(1,1,2,2,2,3,3)
x[x%%2==0] = 5
# [1] 1 1 5 5 5 3 3
why this doesn't work ?
x[x%%3==0] = x*3
I expect to get this:
c(1,1,5,5,5,9,9)
The assignment vectors are not the same on the lhs and rhs of the assignment operator.
length(x*3)
#[1] 7
length(x[x%%3 ==0])
#[1] 2
We need to do
x[x%%3==0] <- x[x%%3==0]*3
x
#[1] 1 1 5 5 5 9 9
Instead of repeating the logical vector, an object can be created and then do the substitution
i1 <- x%%3 == 0
x[i1] <- x[i1]*3
In the first assignment, there was only a single element and it was assigned to replace the values returned by the logical condition is met
Another option is
pmax(x, x*(!x%%3)*3)
#[1] 1 1 5 5 5 9 9

Adding group column to data frame [duplicate]

This question already has an answer here:
Compute the minimum of a pair of vectors
(1 answer)
Closed 7 years ago.
Say I have the following data frame:
dx=data.frame(id=letters[1:4], count=1:4)
# id count
# 1 a 1
# 2 b 2
# 3 c 3
# 4 d 4
And I would like to (grammatically) add a column that will get the count whenever count<3, otherwise 3, so I'll get the following:
# id count group
# 1 a 1 1
# 2 b 2 2
# 3 c 3 3
# 4 d 4 3
I thought to use
dx$group=if(dx$count<3){dx$count}else{3}
but it doesn't work on arrays. How can I do it?
In this particular case you can just use pmin (as I stated in the comments above):
df$group <- pmin(df$count, 3)
In general your if/else construction does not work on vectors, but you can use the function ifelse. It takes three arguments: First the condition, then the result if the condition is met and finally the result if the condition is not met. For your example you would write the following:
df$group <- ifelse(df$count < 3, df$count, 3)
Note that in your example the pmin solution is better. Just mentioning the ifelse solution for completeness.

R reset cumsum when it found 0 [duplicate]

This question already exists:
R ffdfdply reset cumsum using data.table
Closed 9 years ago.
I am using the ff package to load an excel file.
i=as.ffdf(data.frame(a=c(1,1,1,1,1,1), b=c(1,4,6,2,5,3), c=c(1,1,1,1,1,1), d=c(1,0,1,1,0,1)))
I am trying to get the cumulative sum on column d and reset it whenever it found 0. I am trying to get the below output.
a b c d Result
1 1 1 1 1
1 4 1 0 0
1 6 1 1 1
1 2 1 1 2
1 5 1 0 0
1 3 1 1 1
I know, I can easily achieved it through ddply but I have large set of data rows i.e. > 5000000 rows.
Thanks
This will work but little bit slower with 24385601 rows. I created unique combination on column a and c and use the Arun solution. Key column (key_a_c) is used to split the data set i.e. to reset cumsum.
Create a unique key on column a and c
i$key_a_c <- ikey(i[c("a", "c")])
Generate cumulative series by spliting on the basis of key_a_c
p1=ffdfdply(i, split=as.character(i$key_a_c), FUN= function(x) {
x$Result <- as.ff(x[, "d"] * sequence(rle(x[, "d"])$lengths))
as.data.frame(x)
}, trace=T)
Please share your views and code if you have some optimized solution.

Resources