In my dataframe, I have two columns of interest: id and name - my goal is to only keep records of id where id has more than one value in name and where the final value in name is 'B' .
The sample data would look like this:
> test
id name
1 1 A
2 2 A
3 3 A
4 4 A
5 5 A
6 6 A
7 7 A
8 2 B
9 1 B
10 2 A
and the output would look like this:
> output
id name
1 1 A
9 1 B
How would one filter to get these rows in R ? I know that you can filter by the those with multiple variables using the %in% operator but am not sure how to add in the condition that 'B' must be the last record. I am not opposed to using a package like dplyr but a solution in base R would be ideal. Any suggestions?
Here is the sample data:
> dput(test)
structure(list(id = c(1, 2, 3, 4, 5, 6, 7, 2, 1, 2), name = c("A",
"A", "A", "A", "A", "A", "A", "B", "B", "A")), .Names = c("id",
"name"), row.names = c(NA, -10L), class = "data.frame")
Using dplyr,
test %>%
group_by(id) %>%
filter(n_distinct(name) > 1 & last(name) == 'B')
#Source: local data frame [2 x 2]
#Groups: id [1]
# A tibble: 2 x 2
# id name
# <dbl> <chr>
#1 1 A
#2 1 B
In data.table:
library(data.table)
setDT(test)[, .SD[length(unique(name)) >= 2 & name[.N] == "B"],by = .(id)]
# id name
#1: 1 A
#2: 1 B
Related
Is there a way to assign the value of the column being created using an existing value from another column when using case_when() with mutate()?
The actual dataframe I'm dealing with is quite complicated so here is a trivial example of what I want:
library(dplyr)
df = tibble(Assay = c("A", "A", "B", "C", "D", "D"),
My_ID = c(3, 12, 36, 5, 13, 1),
Modifier = c(12, 6, 5, 9, 3, 6))
new_df = df %>%
mutate(Assay = case_when(
My_ID == 5 ~ "C/D",
My_ID == 12 ~ "Rm",
My_ID == 13 | My_ID == 3 ~ Modifier * 3,
TRUE ~ Assay)) %>%
select(-Modifier)
Expected new_df:
# A tibble: 6 x 2
Assay My_ID
<chr> <dbl>
1 36 3
2 Rm 12
3 B 36
4 C/D 5
5 9 13
6 D 1
I can successfully assign the NA values to the column I am mutating when no cases match, but haven't found a way to assign a value based on the value of some other column in the data frame if I'm manipulating it. I get this error:
Error: Problem with `mutate()` column `Assay`.
i `Assay = case_when(...)`.
x must be a character vector, not a double vector.
Is there a way to do this?
I found that I was able to do this using paste() after experimenting. As noted by a commenter, paste() works because the underlying issue here is an object type issue. The Assay column is a character vector, but the modification includes an integer. The function paste() implicitly converts to a character. The function paste0() will fix the problem, but using as.character() directly addresses the issue.
library(dplyr)
df = tibble(Assay = c("A", "A", "B", "C", "D", "D"),
My_ID = c(3, 12, 36, 5, 13, 1),
Modifier = c(12, 6, 5, 9, 3, 6))
new_df = df %>%
mutate(Assay = case_when(
My_ID == 5 ~ "C/D",
My_ID == 12 ~ "Rm",
My_ID == 13 | My_ID == 3 ~ as.character(Modifier * 3),
TRUE ~ Assay)) %>%
select(-Modifier)
This is the output:
print(new_df)
# A tibble: 6 x 2
Assay My_ID
<chr> <dbl>
1 36 3
2 Rm 12
3 B 36
4 C/D 5
5 9 13
6 D 1
I have in R the following data frame:
ID = c(rep(1,5),rep(2,3),rep(3,2),rep(4,6));ID
VAR = c("A","A","A","A","B","C","C","D",
"E","E","F","A","B","F","C","F");VAR
CATEGORY = c("ANE","ANE","ANA","ANB","ANE","BOO","BOA","BOO",
"CAT","CAT","DOG","ANE","ANE","DOG","FUT","DOG");CATEGORY
DATA = data.frame(ID,VAR,CATEGORY);DATA
That looks like this table below :
ID
VAR
CATEGORY
1
A
ANE
1
A
ANE
1
A
ANA
1
A
ANB
1
B
ANE
2
C
BOO
2
C
BOA
2
D
BOO
3
E
CAT
3
E
CAT
4
F
DOG
4
A
ANE
4
B
ANE
4
F
DOG
4
C
FUT
4
F
DOG
ideal output given the above data frame in R I want to be like that:
ID
TEXTS
category
1
A
ANE
2
C
BOO
3
E
CAT
4
F
DOG
More specifically: I want for ID say 1 to search the most common value in the column VAR which is A and then to search the most common value in the column CATEGORY related to the most common value A which is the ANE and so forth.
How can I do it in R ?
Imagine that it is sample example.My real data frame contains 850.000 rows and has 14000 unique ID.
Another dplyr strategy using count and slice:
library(dplyr)
DATA %>%
group_by(ID) %>%
count(VAR, CATEGORY) %>%
slice(which.max(n)) %>%
select(-n)
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOA
3 3 E CAT
4 4 F DOG
dplyr
library(dplyr)
DATA %>%
group_by(ID) %>%
filter(VAR == names(sort(table(VAR), decreasing=TRUE))[1]) %>%
group_by(ID, VAR) %>%
summarize(CATEGORY = names(sort(table(CATEGORY), decreasing=TRUE))[1]) %>%
ungroup()
# # A tibble: 4 x 3
# ID VAR CATEGORY
# <dbl> <chr> <chr>
# 1 1 A ANE
# 2 2 C BOA
# 3 3 E CAT
# 4 4 F DOG
Data
DATA <- structure(list(ID = c(1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 4, 4, 4, 4, 4, 4), VAR = c("A", "A", "A", "A", "B", "C", "C", "D", "E", "E", "F", "A", "B", "F", "C", "F"), CATEGORY = c("ANE", "ANE", "ANA", "ANB", "ANE", "BOO", "BOA", "BOO", "CAT", "CAT", "DOG", "ANE", "ANE", "DOG", "FUT", "DOG")), class = "data.frame", row.names = c(NA, -16L))
We could modify the Mode to return the index and use that in slice after grouping by 'ID'
Modeind <- function(x) {
ux <- unique(x)
which.max(tabulate(match(x, ux)))
}
library(dplyr)
DATA %>%
group_by(ID) %>%
slice(Modeind(VAR)) %>%
ungroup
-output
# A tibble: 4 x 3
ID VAR CATEGORY
<dbl> <chr> <chr>
1 1 A ANE
2 2 C BOO
3 3 E CAT
4 4 F DOG
A base R option with nested subset + ave
subset(
subset(
DATA,
!!ave(ave(ID, ID, VAR, FUN = length), ID, FUN = function(x) x == max(x))
),
!!ave(ave(ID, ID, VAR, CATEGORY, FUN = length), ID, VAR, FUN = function(x) seq_along(x) == which.max(x))
)
gives
ID VAR CATEGORY
1 1 A ANE
6 2 C BOO
9 3 E CAT
11 4 F DOG
Explanation
The inner subset + ave is to filter out the rows with the most common VAR values (grouped by ID)
Based on the trimmed data frame the previous step, the outer subset + ave is to filter out the rows with the most common CATEGORY values ( grouped by ID + VAR)
Lets assume a df like this:
X VAL
1 a
2 b
5 a
9 b
32 a
33 b
40 b
42 a
I want to drop all rows where the X[i+1]-X[i] != 1 where I compare grouped pairs (first I look at rows 1-2 then rows 3-4, etc.) and at the same time there has to be VAL = a in a first row of the pair and VAL = b in the second row of the pair. Resulting df should look like this:
X VAL
1 a
2 b
32 a
33 b
Any help is appreciated. Thanks!
If you are looking at pairs of rows, you can group_by 2 rows at a time, and filter (keep) rows where the difference is 1.
Edit: Answer also checks to make sure the first row within a pair is 'a' and the last row within the pair is 'b'.
library(tidyverse)
df %>%
group_by(cumsum(row_number() %% 2)) %>%
filter(diff(X) == 1 && first(VAL) == 'a' && last(VAL) == 'b') %>%
ungroup() %>%
select(X, VAL)
Output
# A tibble: 4 x 2
X VAL
<dbl> <chr>
1 1 a
2 2 b
3 32 a
4 33 b
Data
df <- structure(list(X = c(1, 2, 5, 9, 32, 33, 40, 42), VAL = c("a",
"b", "a", "b", "a", "b", "b", "a")), class = "data.frame", row.names = c(NA,
-8L))
Just do:
df[c(diff(df$x),0) == 1, ]
I have a longitudinal dataset where each subject is represented more than once. One represents one admission for a patient. Each admission, regardless of the subject also has a unique "key". I need to figure out which admission is the "INDEX" admission, that is, the first admission, so that I know that which rows are the subsequent RE-admission. The variable to use is "Daystoevent"; the lowest number represents the INDEX admission. I want to create a new variable based on the condition that for each subject, the lowest number in the variable "Daystoevent" is the "index" admission and each subsequent gets a number "1" , "2" etc. I want to do this WITHOUT changing into the horizontal format.
The dataset looks like this:
Subject Daystoevent Key
A 5 rtwe
A 8 erer
B 3 tter
B 8 qgfb
A 2 sada
C 4 ccfw
D 7 mjhr
B 4 sdfw
C 1 srtg
C 2 xcvs
D 3 muyg
Would appreciate some help.
This may not be an elegant solution but will do the job:
library(dplyr)
df <- df %>%
group_by(Subject) %>%
arrange(Subject, Daystoevent) %>%
mutate(
Admission = if_else(Daystoevent == min(Daystoevent), 0, 1),
) %>%
ungroup()
for(i in 1:(nrow(df) - 1)) {
if(df$Admission[i] == 1) {
df$Admission[i + 1] <- 2
} else if(df$Admission[i + 1] != 0){
df$Admission[i + 1] <- df$Admission[i] + 1
}
}
df[df == 0] <- "index"
df
# # A tibble: 11 x 4
# Subject Daystoevent Key Admission
# <chr> <dbl> <chr> <chr>
# 1 A 2 sada index
# 2 A 5 rtwe 1
# 3 A 8 erer 2
# 4 B 3 tter index
# 5 B 4 sdfw 1
# 6 B 8 qgfb 2
# 7 C 1 srtg index
# 8 C 2 xcvs 1
# 9 C 4 ccfw 2
# 10 D 3 muyg index
# 11 D 7 mjhr 1
Data:
df <- data_frame(
Subject = c("A", "A", "B", "B", "A", "C", "D", "B", "C", "C", "D"),
Daystoevent = c(5, 8, 3, 8, 2, 4, 7, 4, 1, 2, 3),
Key = c("rtwe", "erer", "tter", "qgfb", "sada", "ccfw", "mjhr", "sdfw", "srtg", "xcvs", "muyg")
)
This question already has answers here:
How to remove rows that have only 1 combination for a given ID
(4 answers)
Selecting & grouping dual-category data from a data frame
(4 answers)
Closed 5 years ago.
I have a df looks like
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
Basically, A is 1, B is 2, C is 3 and so on.
However, as you can see, B has "2" and "15"."15" is the wrong value and it should not be here.
I would like to find out the row which Value won't matches within the same Name.
Ideal output will looks like
B 2
B 15
you can use tidyverse functions like:
df %>%
group_by(Name, Value) %>%
unique()
giving:
Name Value
1 A 1
2 B 2
3 B 15
4 C 3
5 D 4
6 E 5
then, to keep only the Name with multiple Value, append above with:
df %>%
group_by(Name) %>%
filter( n() > 1)
Something like this? This searches for Names that are associated to more than 1 value and outputs one copy of each pair {Name - Value}.
df <- data.frame(Name = c("A", "A","A","B", "B", "C", "D", "E", "E"),
Value = c(1, 1, 1, 2, 15, 3, 4, 5, 5))
res <- do.call(rbind, lapply(unique(df$Name), (function(i){
if (length(unique(df[df$Name == i,]$Value)) > 1 ) {
out <- df[df$Name == i,]
out[!duplicated(out$Value), ]
}
})))
res
Result as expected
Name Value
4 B 2
5 B 15
Filter(function(x)nrow(unique(x))!=1,split(df,df$Name))
$B
Name Value
4 B 2
5 B 15
Or:
Reduce(rbind,by(df,df$Name,function(x) if(nrow(unique(x))>1) x))
Name Value
4 B 2
5 B 15