I'm looking to subset rows by the value of the next row for one column.
df <- data.frame(t = c(1,2,3,4,5,6,7,8),
b = c(1,2,1,0,1,0,1,2))
So I want to subset df and get the rows where b == 2 following any row where b == 1. So subset should return 2 rows (where t=1 and t=7)
I tried using which and lag from dplyr, as mentioned in other answers, but I couldn't get that to work.
We can get the next value with lead, create a condition to check whether it is equal to 2 and the current value is 1 and use that expression in the filter
library(dplyr)
df %>%
filter(b == 1, lead(b)==2)
# t b
#1 1 1
#2 7 1
Or use subset from base R
subset(df, c(b[-1] == 2, FALSE) & b == 1)
Related
I have the dataframe below and I would like to move the row with C in column Reg as 1st row.
Reg <- rep(LETTERS[1:3], each = 1)
Res <- c("Urban", "Rural","Urban")
df <- data.frame(Reg, Res)
The simple way is to create a logical vector on 'Reg' where the logical operator specified is != (not equal to) returning all other values as TRUE and the row corresponding to 'C' as FALSE. When we order, 'F' comes before 'T' in alphabetic ordering and thus 'C' rows will be the top followed by the rest
df[order(df$Reg != 'C'),]
Or similar options in dplyr are
library(dplyr)
df %>%
slice(order(Reg != 'C'))
Or with arrange
df %>%
arrange(Reg != 'C')
Here is another option
> df[order(replace(seq(nrow(df)),df$Reg=="C",0)),]
Reg Res
3 C Urban
1 A Urban
2 B Rural
There is a similar question for data.table at Replace sets of rows in a data.table with a single row but I am looking for an equivalent solution in tidyverse. So, I have a tibble like:
DT <- tibble (
id=c(1,1,1,1,2,2,2,2),
location = c("a","b","c","d","a","b","d","e"),
seq = c(1,2,3,4,1,2,3,4))
For every id, I want to look for the sequence b,c,d and if there is such a thing, I want to replace the rows with b and c with a single row, let's say z. The values for the other variables should retain the values of the previous b (in this case id and seq)
So in this case, the new tibble should be
DT.Tobe <- tibble (
id=c(1,1,1,2,2,2,2),
place = c("a","z","d","a","b","d","e"),
seq = c(1,2,4,1,2,3,4))
I was not able to find even a starting point for this...
library(dplyr)
# library(zoo) # rollapply
DT %>%
group_by(id) %>%
mutate(
isseq = zoo::rollapply(location, 3, FUN = function(z) identical(z, c("b", "c", "d")), align = "left", partial = TRUE),
isseq = isseq | lag(isseq, default = FALSE)
) %>%
group_by(id, isseq) %>%
summarize(
across(everything(), ~ {
if (cur_group()$isseq) {
if (cur_column() == "location") "z" else first(.)
} else .
})
) %>%
ungroup() %>%
select(-isseq)
# # A tibble: 7 x 3
# id location seq
# <dbl> <chr> <dbl>
# 1 1 a 1
# 2 1 d 4
# 3 1 z 2
# 4 2 a 1
# 5 2 b 2
# 6 2 d 3
# 7 2 e 4
The order is changed because the group_by(isseq) tends to keep "like" together. This should be easy to either re-order (assuming "seq" is meaningful) or pre-add an order variable and using it later.
If it is possible for a single id to have multiple of such sequences (if so, say something), then run-length encoding will be needed here as well (to differentiate between different b-c-d sequences in the same id).
One possible option would be to use a for-loop. Here is my pseudocode.
for (i in nrows(DT)){ # Repeat this if statement for each row in your DT
if (place[i] == "b & place[i+1] == "c"){ # if the first item is B and the second item is C
DT <- DT %>%
dplyr::replace(place[i] == "z") # Replaces item B with the z character
DT[-(i+1)] # Deletes item C's row
}
}
The Dplyr cheat sheet has some useful functions that may help with finding the right tools for the if-statement part of this pseudocode.
What are your thoughts?
I want to do a similar thing as in this thread: Subset multiple columns in R - more elegant code?
I have data that looks like this:
df=data.frame(x=1:4,Col1=c("A","A","C","B"),Col2=c("A","B","B","A"),Col3=c("A","C","C","A"))
criteria="A"
What I want to do is to subset the data where criteria is meet in at least two columns, that is the string in at least two of the three columns is A. In the case above, the subset would be the first and last row of the data frame df.
You can use rowSums :
df[rowSums(df[-1] == criteria) >= 2, ]
# x Col1 Col2 Col3
#1 1 A A A
#4 4 B A A
If criteria is of length > 1 you cannot use == directly in which case use sapply with %in%.
df[rowSums(sapply(df[-1], `%in%`, criteria)) >= 2, ]
In dplyr you can use filter with rowwise :
library(dplyr)
df %>%
rowwise() %>%
filter(sum(c_across(starts_with('col')) %in% criteria) >= 2)
We can use subset with apply
subset(df, apply(df[-1] == criteria, 1, sum) >1)
# x Col1 Col2 Col3
#1 1 A A A
#4 4 B A A
I would like to compute an id variable based on the unique combination of two (or more) variables. Consider the simple example below:
# Example dataframe
mydf <- data.frame(var1 = LETTERS[c(1, 2, 1)], var2 = LETTERS[c(2, 1, 3)])
mydf
# var1 var2
# A B
# B A
# A C
Here, rows 1 and 2 should have the same id because AB and BA represent a combination of the same elements. Row 3 however, has a different id since the AC combination appear only once.
# Desired output
cbind(mydf, cid = c(1, 1, 2))
# var1 var2 cid
# A B 1
# B A 1
# A C 2
Any suggestion?
We can sort by row, create a logical vector with duplicated and get the cumsum
cbind(mydf, cid = cumsum(!duplicated(t(apply(mydf, 1, sort)))))
You could benefit from factor type in base R for that:
mydf$cid <- as.numeric(factor(apply(mydf,1,function(x) paste0(sort(x), collapse = ""))))
It disregards the order by which the equivalent rows are appeared in data frame. cumsum does not work once, for example, the rows 2 and 3 are switched in your data frame.
I have a data.frame with 2 columns. I want the script to return the value of observations if I provide the value ID. The values in ID are unique.
ID = c("A","B","C","D")
observations = c(3,4,3,2)
d = data.frame(ID, observations)
ID observations
1 A 3
2 B 4
3 C 3
4 D 2
I'd like to access the data frame in a way that it returns me the value of the column observations if I provide the respective ID for the row. (Keep in mind that every ID occurs only in one row).
So for example if I provide the ID = A, it returns 3.
Likewise, if ID == B, it returns 4.
Another option using dplyr
require(dplyr)
ID = c("A","B","C","D")
observations = c("3","4","3","2")
d = data.frame(ID, observations)
d %>%
filter(ID == "D") %>%
select(observations)