R: Leave rows in a df based on a specific condition - r

Lets assume a df like this:
X VAL
1 a
2 b
5 a
9 b
32 a
33 b
40 b
42 a
I want to drop all rows where the X[i+1]-X[i] != 1 where I compare grouped pairs (first I look at rows 1-2 then rows 3-4, etc.) and at the same time there has to be VAL = a in a first row of the pair and VAL = b in the second row of the pair. Resulting df should look like this:
X VAL
1 a
2 b
32 a
33 b
Any help is appreciated. Thanks!

If you are looking at pairs of rows, you can group_by 2 rows at a time, and filter (keep) rows where the difference is 1.
Edit: Answer also checks to make sure the first row within a pair is 'a' and the last row within the pair is 'b'.
library(tidyverse)
df %>%
group_by(cumsum(row_number() %% 2)) %>%
filter(diff(X) == 1 && first(VAL) == 'a' && last(VAL) == 'b') %>%
ungroup() %>%
select(X, VAL)
Output
# A tibble: 4 x 2
X VAL
<dbl> <chr>
1 1 a
2 2 b
3 32 a
4 33 b
Data
df <- structure(list(X = c(1, 2, 5, 9, 32, 33, 40, 42), VAL = c("a",
"b", "a", "b", "a", "b", "b", "a")), class = "data.frame", row.names = c(NA,
-8L))

Just do:
df[c(diff(df$x),0) == 1, ]

Related

How can I assign a value with case_when() from dplyr based on another column value?

Is there a way to assign the value of the column being created using an existing value from another column when using case_when() with mutate()?
The actual dataframe I'm dealing with is quite complicated so here is a trivial example of what I want:
library(dplyr)
df = tibble(Assay = c("A", "A", "B", "C", "D", "D"),
My_ID = c(3, 12, 36, 5, 13, 1),
Modifier = c(12, 6, 5, 9, 3, 6))
new_df = df %>%
mutate(Assay = case_when(
My_ID == 5 ~ "C/D",
My_ID == 12 ~ "Rm",
My_ID == 13 | My_ID == 3 ~ Modifier * 3,
TRUE ~ Assay)) %>%
select(-Modifier)
Expected new_df:
# A tibble: 6 x 2
Assay My_ID
<chr> <dbl>
1 36 3
2 Rm 12
3 B 36
4 C/D 5
5 9 13
6 D 1
I can successfully assign the NA values to the column I am mutating when no cases match, but haven't found a way to assign a value based on the value of some other column in the data frame if I'm manipulating it. I get this error:
Error: Problem with `mutate()` column `Assay`.
i `Assay = case_when(...)`.
x must be a character vector, not a double vector.
Is there a way to do this?
I found that I was able to do this using paste() after experimenting. As noted by a commenter, paste() works because the underlying issue here is an object type issue. The Assay column is a character vector, but the modification includes an integer. The function paste() implicitly converts to a character. The function paste0() will fix the problem, but using as.character() directly addresses the issue.
library(dplyr)
df = tibble(Assay = c("A", "A", "B", "C", "D", "D"),
My_ID = c(3, 12, 36, 5, 13, 1),
Modifier = c(12, 6, 5, 9, 3, 6))
new_df = df %>%
mutate(Assay = case_when(
My_ID == 5 ~ "C/D",
My_ID == 12 ~ "Rm",
My_ID == 13 | My_ID == 3 ~ as.character(Modifier * 3),
TRUE ~ Assay)) %>%
select(-Modifier)
This is the output:
print(new_df)
# A tibble: 6 x 2
Assay My_ID
<chr> <dbl>
1 36 3
2 Rm 12
3 B 36
4 C/D 5
5 9 13
6 D 1

How can I extract a subset of data based on another data frame and grab observations before and after that subset

I have two data frames. df_sub is a subset of the main data frame, df. I want to take a subset of df based on df_sub where the resulting data frame is going to be df_sub plus the observations that occur before and after.
As an example, consider the two data sets
df <- data.frame(var1 = c("a", "x", "x", "y", "z", "t"),
var2 = c(4, 1, 2, 45, 56, 89))
df_sub <- data.frame(var1 = c("x", "y"),
var2 = c(2, 45))
They look like
> df
var1 var2
1 a 4
2 x 1
3 x 2
4 y 45
5 z 56
6 t 89
> df_sub
var1 var2
1 x 2
2 y 45
The result I want would be
> df_result
2 x 1
3 x 2
4 y 45
5 z 56
I was thinking of using an inner_join or something similar
We could use match to get the index, then add or subtract 1 on those index, take the unique and subset the rows
v1 <- na.omit(match(do.call(paste, df_sub), do.call(paste, df)) )
df[unique(v1 + rep(c(-1, 0, 1), each = length(v1))),]
-output
var1 var2
2 x 1
3 x 2
4 y 45
5 z 56
Or create a 'flag' column in the 'df_sub', do a left_join, and then filter based on the lead/lag values of 'flag'
library(dplyr)
df %>%
left_join(df_sub %>%
mutate(flag = TRUE)) %>%
filter(flag|lag(flag)|lead(flag)) %>%
select(-flag)
var1 var2
1 x 1
2 x 2
3 y 45
4 z 56
You can create a row number to keep track of the rows that are selected via join. Subset the data by including minimum row number - 1 and maximum row number + 1.
library(dplyr)
tmp <- df %>%
mutate(row = row_number()) %>%
inner_join(df_sub, by = c("var1", "var2"))
df[c(min(tmp$row) - 1, tmp$row, max(tmp$row) + 1), ]
# var1 var2
#2 x 1
#3 x 2
#4 y 45
#5 z 56

How to create a rank for a variable in a longitudinal dataset based on a condition?

I have a longitudinal dataset where each subject is represented more than once. One represents one admission for a patient. Each admission, regardless of the subject also has a unique "key". I need to figure out which admission is the "INDEX" admission, that is, the first admission, so that I know that which rows are the subsequent RE-admission. The variable to use is "Daystoevent"; the lowest number represents the INDEX admission. I want to create a new variable based on the condition that for each subject, the lowest number in the variable "Daystoevent" is the "index" admission and each subsequent gets a number "1" , "2" etc. I want to do this WITHOUT changing into the horizontal format.
The dataset looks like this:
Subject Daystoevent Key
A 5 rtwe
A 8 erer
B 3 tter
B 8 qgfb
A 2 sada
C 4 ccfw
D 7 mjhr
B 4 sdfw
C 1 srtg
C 2 xcvs
D 3 muyg
Would appreciate some help.
This may not be an elegant solution but will do the job:
library(dplyr)
df <- df %>%
group_by(Subject) %>%
arrange(Subject, Daystoevent) %>%
mutate(
Admission = if_else(Daystoevent == min(Daystoevent), 0, 1),
) %>%
ungroup()
for(i in 1:(nrow(df) - 1)) {
if(df$Admission[i] == 1) {
df$Admission[i + 1] <- 2
} else if(df$Admission[i + 1] != 0){
df$Admission[i + 1] <- df$Admission[i] + 1
}
}
df[df == 0] <- "index"
df
# # A tibble: 11 x 4
# Subject Daystoevent Key Admission
# <chr> <dbl> <chr> <chr>
# 1 A 2 sada index
# 2 A 5 rtwe 1
# 3 A 8 erer 2
# 4 B 3 tter index
# 5 B 4 sdfw 1
# 6 B 8 qgfb 2
# 7 C 1 srtg index
# 8 C 2 xcvs 1
# 9 C 4 ccfw 2
# 10 D 3 muyg index
# 11 D 7 mjhr 1
Data:
df <- data_frame(
Subject = c("A", "A", "B", "B", "A", "C", "D", "B", "C", "C", "D"),
Daystoevent = c(5, 8, 3, 8, 2, 4, 7, 4, 1, 2, 3),
Key = c("rtwe", "erer", "tter", "qgfb", "sada", "ccfw", "mjhr", "sdfw", "srtg", "xcvs", "muyg")
)

Subsetting if contains multiple variables in a certain order

In my dataframe, I have two columns of interest: id and name - my goal is to only keep records of id where id has more than one value in name and where the final value in name is 'B' .
The sample data would look like this:
> test
id name
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 2 B
9 1 B
10 2 A
and the output would look like this:
> output
id name
1 1 A
9 1 B
How would one filter to get these rows in R ? I know that you can filter by the those with multiple variables using the %in% operator but am not sure how to add in the condition that 'B' must be the last record. I am not opposed to using a package like dplyr but a solution in base R would be ideal. Any suggestions?
Here is the sample data:
> dput(test)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 2, 1, 2), name = c("A",
"A", "A", "A", "A", "A", "A", "B", "B", "A")), .Names = c("id",
"name"), row.names = c(NA, -10L), class = "data.frame")
Using dplyr,
test %>%
group_by(id) %>%
filter(n_distinct(name) > 1 & last(name) == 'B')
#Source: local data frame [2 x 2]
#Groups: id [1]
# A tibble: 2 x 2
# id name
# <dbl> <chr>
#1 1 A
#2 1 B
In data.table:
library(data.table)
setDT(test)[, .SD[length(unique(name)) >= 2 & name[.N] == "B"],by = .(id)]
# id name
#1: 1 A
#2: 1 B

cumsum when current obs equals next obs for same variable (column)

I want to add a column to a dataframe that makes a cumulated sum of another variable if yet another variable is equal for two rows. For example:
Row Var1 Var2 CumVal
1 A 2 2
2 A 4 6
3 B 5 5
So I want CumVal to cumulate/sum the Var2 column, if Var1 obs for row 2 equals Var1 obs for row 1. With other words, if it is equal to the obs before.
If the cumsum is based on the Var1 as a grouping variable
library(dplyr)
df %>%
group_by(Var1) %>%
mutate(CumVal=cumsum(Var2))
Or
library(data.table)
setDT(df)[, CumVal:=cumsum(Var2), by=Var1]
Or using base R
transform(df, CumVal=ave(Var2, Var1, FUN=cumsum))
Update
If it is based on whether adjacent elements are not equal
transform(df, CumVal= ave(Var2, cumsum(c(TRUE,Var1[-1]!=
Var1[-nrow(df)])), FUN=cumsum))
# Row Var1 Var2 CumVal
#1 1 A 2 2
#2 2 A 4 6
#3 3 B 5 5
#4 4 A 6 6
Or the dplyr approach
df %>%
group_by(indx= cumsum(c(TRUE,(lag(Var1)!=Var1)[-1]))) %>%
mutate(CumVal=cumsum(Var2)) %>%
ungroup() %>%
select(-indx)
data
df <- structure(list(Row = 1:4, Var1 = c("A", "A", "B", "A"), Var2 = c(2L,
4L, 5L, 6L)), .Names = c("Row", "Var1", "Var2"), class = "data.frame",
row.names = c(NA, -4L))
I like rle, which detects similar successive values in a vector and describe it in a nice synthetic way. E.g. let's say we have a vector x of length 10:
x <- c(2, 3, 2, 2, 2, 2, 0, 0, 2, 1)
rle is able to detect that there are 4 successive 2s and 2 successive 0s:
rle(x)
# Run Length Encoding
# lengths: int [1:6] 1 1 4 2 1 1
# values : num [1:6] 2 3 2 0 2 1
(in the output, we can that there are 2 lengths different from 1 corresponding to values 4 and 2)
We can use this function to apply cumsum to subvectors of another vector. Let's say we want to apply cumcum on a new vector y <- 1:10, but only for repeated values of x (which will be stored in a factor f):
y <- 1:10
z <- rle(x)$lengths
f <- factor(rep( seq_along(z), z) )
We can then use by or tapply (or something else to achieve the desired output):
cumval <- unlist(tapply(y, f, cumsum))

Resources