lag() and lead() in base-R [duplicate] - r

This question already has answers here:
Shifting a vector
(3 answers)
Closed 3 years ago.
I'm used to using dplyr's lag() and lead() in my code, but I'm wondering -- is there a base R alternative?
For example, assume the following dataframe:
df<-data.frame(a=c("a","a","a","b","b"),stringsAsFactors=FALSE)
Using dplyr, I could do this to mark the beginning of a new grouping in a:
df %>% mutate(groupstart=a!=lag(a)|is.na(lag(a)))
a groupstart
1 a TRUE
2 a FALSE
3 a FALSE
4 b TRUE
5 b FALSE
Is there a way to do this in base R?

You could do something like this, where NAs are combined with a subset of df$a in lag_a, which is then compared with df$a:
lag_a <- c(rep(NA, 1), head(df$a, length(df$a) - 1))
df$groupstart <- df$a != lag_a | is.na(lag_a)
#### OUTPUT ####
a groupstart
1 a TRUE
2 a FALSE
3 a FALSE
4 b TRUE
5 b FALSE
You can generalize this principle in a function:
lead_lag <- function(v, n) {
if (n > 0) c(rep(NA, n), head(v, length(v) - n))
else c(tail(v, length(v) - abs(n)), rep(NA, abs(n)))
}
#### OUTPUT ####
lead_lag(df$a, 2) #[1] NA NA "a" "a" "a"
lead_lag(df$a, -2) #[1] "a" "b" "b" NA NA
lead_lag(df$a, 3) #[1] NA NA NA "a" "a"
lead_lag(df$a, -4) #[1] "b" NA NA NA NA

Related

R - identify sequences in a vector

Suppose I have a vector ab containing A's and B's. I want to identify sequences and create a vector v with length(ab) that indicates the sequence length at the beginning and end of a given sequence and NA otherwise.
I have however the restriction that another vector x with 0/1 will indicate that a sequence ends.
So for example:
rep("A", 6)
"A" "A" "A" "A" "A" "A"
x <- c(0,0,1,0,0,0)
0 0 1 0 0 0
should give
v <- c(3 NA 3 3 NA 3)
An example could be the following:
ab <- c(rep("A", 5), "B", rep("A", 3))
"A" "A" "A" "A" "A" "B" "A" "A" "A"
x <- c(rep(0,3),1,0,1,rep(0,3))
0 0 0 1 0 1 0 0 0
Here the output should be:
4 NA NA 4 1 1 3 NA 3
(without the restriction it would be)
5 NA NA NA 5 1 3 NA 3
So far, my code without the restriction looks like this:
ab <- c(rep("A", 5), "B", rep("A", 3))
x <- c(rep(0,3),1,0,1,rep(0,3))
cng <- ab[-1L] != ab[-length(ab)] # is there a change in A and B w.r.t the previous value?
idx <- which(cng) # where do the changes take place?
idx <- c(idx,length(ab)) # include the last value
seq_length <- diff(c(0, idx)) # how long are the sequences?
# create v
v <- rep(NA, length(ab))
v[idx] <- seq_length # sequence end
v[idx-(seq_length-1)] <- seq_length # sequence start
v
Does anyone have an idea how I can implement the restriction? (And since my vector has 2 Millions of observations, I wonder whether there would be a more efficient way than my approach)
I would appreciate any comments! Many thanks in advance!
You may do something like this
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
#creating result of lengths
res <- as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
> res
[1] 4 4 4 4 1 1 1 3 3 3
#creating intermediate NAs
replace(res, with(rle(res), setdiff(seq_along(res), c(length(res) + 1 - cumsum(rev(lengths)),
cumsum(lengths),
which(res == 1)))), NA)
[1] 4 NA NA 4 1 1 1 3 NA 3
As per edited scenario
x <- c(rep(0,3),1,rep(0,2),1,rep(0,3))
ab <- c(rep("A", 5), "B", rep("A", 4))
ab[3] <- 'B'
as.numeric(ave(ab, rev(cumsum(rev(x))), FUN = function(z){with(rle(z), rep(lengths, lengths))}))
[1] 2 2 1 1 1 1 1 3 3 3
ab
[1] "A" "A" "B" "A" "A" "B" "A" "A" "A" "A"

Extract different col in every row of an R data.frame [duplicate]

This question already has answers here:
Using row-wise column indices in a vector to extract values from data frame [duplicate]
(2 answers)
Closed 3 years ago.
I have a vector of colnames that is as long as the number of rows in a data frame:
> x <- data.frame(a=c(1,2,3), b=c(3,2,1), c=c(5,6,4))
> cols <- c("c", "a", "b")
> x
a b c
1 1 3 5
2 2 2 6
3 3 1 4
Now I want to extract from x the column cols[i] for each row i of x, that is 5, 2, 1 in this case. I have tried to create a matrix with T and F depending on the macth:
> A <- matrix(rep(colnames(x),nrow(x)), nrow=nrow(x), ncol=ncol(x), byrow=TRUE) == cols
> A
[,1] [,2] [,3]
[1,] FALSE FALSE TRUE
[2,] TRUE FALSE FALSE
[3,] FALSE TRUE FALSE
This looks correct, but when I use this as an index, the result is returned by row:
> x[A]
[1] 2 1 5
Does someone know of the proper way to solve this indexing problem?
x <- data.frame(a=c(1,2,3), b=c(3,2,1), c=c(5,6,4))
cols <- c("c", "a", "b")
sapply(1:length(cols),function(i){x[i,cols[i]]})
[1] 5 2 1

R Subsetting Specific Value Also Returns NA?

I am just starting out on learning R and came across a piece of code as follows
vec_1 <- c("a","b", NA, "c","d")
# create a subet of all elements which equal "a"
vec_1[vec_1 == "a"]
The result from this is
## [1] "a" NA
Im just curious, since I am subsetting vec_1 for the value "a", why does NA also show up in my results?
This is because the result of anything == NA is NA. Even NA == NA is NA.
Here's the output of vec_1 == "a" -
[1] TRUE FALSE NA FALSE FALSE
and NA is not TRUE or FALSE so when you subset anything by NA you get NA. Check this out -
vec_1[NA]
[1] NA NA NA NA NA
When dealing with NA, R tries to provide the most informative answer i.e. T | NA returns TRUE because it doesn't matter what NA is. Here are some more examples -
T | NA
[1] TRUE
F | NA
[1] NA
T & NA
[1] NA
F & NA
[1] FALSE
R has no way to test equality with NA. In your case you can use %in% operator -
5 %in% NA
[1] FALSE
"a" %in% NA
[1] FALSE
vec_1[vec_1 %in% "a"]
[1] "a"

R data.fame manipulation: convert to NA after specific column

I have a large data.frame and I need some conversion based by row. My purpose is convert all values in rows to NA after if there is specific character in column.
For example I provide little sample from my real data set:
sample_df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
result_df <- data.frame( a = c("V","I","V","V"), b = c("I",NA,"V","V"), c = c(NA,NA,"I","V"), d = c(NA,NA,NA,"V"))
As an example in sample_df
First I want to turn all values to NA after first "I"
Sample data.frames
I tried base, dpylr, purrr but can not create an algorithm.
Thanks for your help.
Try this:
Find "I" values
I_true<-sample_df=="I"
I_true
a b c d
[1,] FALSE TRUE FALSE FALSE
[2,] TRUE FALSE FALSE FALSE
[3,] FALSE FALSE TRUE TRUE
[4,] FALSE FALSE FALSE FALSE
Find positions from the first "I" seen
out<-t(apply(t(I_true),2,cumsum))
out
a b c d
[1,] 0 1 1 1
[2,] 1 1 1 1
[3,] 0 0 1 2
[4,] 0 0 0 0
Replace needed values
output<-out
output[out>=1]<-NA
output[output==0]<-"V"
output[I_true]<-"I"
output[out>=2]<-NA
Your output
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "V" "V" "I" "I"
[4,] "V" "V" "V" "V"
Example 2:
sample_df <- data.frame( a = c("V","I","I","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"))
sample_df
a b c d
1 V I V V
2 I V V V
3 I V I I
4 V V V V
output
a b c d
[1,] "V" "I" NA NA
[2,] "I" NA NA NA
[3,] "I" NA NA NA
[4,] "V" "V" "V" "V"
Here is a brute force approach, which should be the easiest to come up with but the least preferred. Anyway, here it is:
df <- data.frame( a = c("V","I","V","V"), b = c("I","V","V","V"), c = c("V","V","I","V"), d = c("V","V","I","V"), stringsAsFactors=FALSE)
rowlength<-length(colnames(df))
for (i in 1:length(df[,1])){
if (any(as.character(df[i,])=='I')){
first<-which(as.character(df[i,])=='I')[1]+1
df[i,first:rowlength]<-NA
}
}
Here's a possible answer using ddply from the plyr package
ddply(sample_df,.(a,b,c,d), function(x){
idx<-which(x=='I')[1]+1 #ID after first 'I'
if(!is.na(idx)){ #Check if found
if(idx<=ncol(x)){ # Prevent out of bounds
x[,idx:ncol(x)]<-NA
}
}
x
})
The plyr approach :
plyr::adply(sample_df, 1L, function(x) {
if (all(x != "I"))
return(x)
x[1L:min(which(x == "I"))]
})
You have to use an if because x[min(which(x == "I"))] would returns numeric(0) for rows without at least one I
My Solution:
After #Julien Navarre recommendation, first I created toNA() function:
toNA <- function(x) {
temp <- grep("INVALID", unlist(x)) # which can be generalized for any string
lt <- length(x)
loc <- min(temp,100)+1 #100 is arbitrary number bigger than actual column count
#print(lt) #Debug purposes
if( (loc < lt+1) ) {
x[ (loc):(lt)] <-NA
}
x
}
First, I tried plyr::adply() and purrrlyr::by_row() functions to apply my toNA() function my data.frame which has over 3 million rows.
Both are very slow. (For 1000 rows they take 9 and 6 seconds respectively). These approaches are also slow with a simple function(x) x. I am not sure what is overhead.
So I tried base::apply() function: (result is my data set)
as.tibble(t(apply(result, 1, toNA ) ))
It only takes 0.2 seconds for 1000 rows.
I am not sure about programming style but for now this solution works for me.
Thanks for all your recommendations.
A pure base solution, we're building a boolean matrix of "=="I" or not", then with a double cumsum by row we can find where our NAs must be placed:
result_df <- sample_df
is.na(result_df) <- t(apply(sample_df == "I",1,function(x) cumsum(cumsum(x)))) >1
result_df
# a b c d
# 1 V I <NA> <NA>
# 2 I <NA> <NA> <NA>
# 3 V V I <NA>
# 4 V V V V

Sub-setting elements of a list in R

Assume that this is my list
a <- list(c(1,2,4))
a[[2]] <- c(2,10,3,2,7)
a[[3]] <- c(2, 2, 14, 5)
How do I subset this list to exclude all the 2's. How do I obtain the following:
[[1]]
[1] 1 4
[[2]]
[1] 10 3 7
[[3]]
[1] 14 5
My current solution:
for(j in seq(1, length(a))){
a[[j]] <- a[[j]][a[[j]] != 2]
}
However, this approach feels a bit unnatural. How would I do the same thing with a function from the apply family?
Thanks!
lapply(a, function(x) x[x != 2])
#[[1]]
#[1] 1 4
#
#[[2]]
#[1] 10 3 7
#
#[[3]]
#[1] 14 5
Using lapply you can apply the subset to each vector in the list. The subset used is, x[x != 2].
Or use setdiff by looping over the list with lapply
lapply(a, setdiff, 2)
#[[1]]
#[1] 1 4
#[[2]]
#[1] 10 3 7
#[[3]]
#[1] 14 5

Resources