Consider the following data frame consisting of column names "id" and "x", where each id is repeated four times. Data is as follows:
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
The question is about how to subset the data frame by the following criteria:
(1) keep all entries of each id, if its corresponding values in column x does not contain 3 or it has 3 as the last number.
(2) for a given id with multiple 3s in column x, keep all the numbers up to the first 3 and delete the remaining 3s. The expected output would look like:
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
I am familiar with the use of the 'filter' function in dplyr package to subset data, but this particular situation confuses me because of the complexity of the above criteria. Any help on this would be greatly appraciated.
Here's one solution that uses / creates some new columns to help you filter on:
library(dplyr)
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
df %>%
group_by(id) %>% # for each id
mutate(num_threes = sum(x == 3), # count number of 3s
flag = ifelse(unique(num_threes) > 0, # if there is a 3
min(row_number()[x == 3]), # keep the row of the first 3
0)) %>% # otherwise put a 0
filter(num_threes == 0 | row_number() <= flag) %>% # keep ids with no 3s or up to first 3
ungroup() %>%
select(-num_threes, -flag) # remove helpful columns
# # A tibble: 13 x 2
# id x
# <dbl> <dbl>
# 1 1 2
# 2 1 2
# 3 1 1
# 4 1 1
# 5 2 2
# 6 2 3
# 7 3 1
# 8 3 2
# 9 3 2
# 10 3 3
# 11 4 2
# 12 4 2
# 13 4 3
this works for me:
data
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
commands
library(dplyr)
df <- mutate(df, before = lag(x))
df$condition1 <- 1
df$condition1[df$x == 3 & df$before == 3] <- 0
final_df <- df[df$condition1 == 1, 1:2]
result
x id
1 2
1 2
1 1
1 1
2 2
2 3
3 1
3 2
3 2
3 3
4 2
4 2
4 3`
One idea is to pick out the rows with x==3 and use unique() over them. Then append the unique rows with just single 3 to the rest part of the data frame, and finally order the rows.
Here is a solution with base R for the idea above:
res <- (r <- with(df,rbind(df[x!=3,],unique(df[x==3,]))))[order(as.numeric(rownames(r))),]
rownames(res) <- seq(nrow(res))
which give
> res
id x
1 1 2
2 1 2
3 1 1
4 1 1
5 2 2
6 2 3
7 3 1
8 3 2
9 3 2
10 3 3
11 4 2
12 4 2
13 4 3
DATA
df<-data.frame("id"=c(1,1,1,1,2,2,2,2,3,3,3,3,4,4,4,4),
"x"=c(2,2,1,1,2,3,3,3,1,2,2,3,2,2,3,3))
I have a large dataset of matched pairs (id1 and id2) and would like to create an index variable to enable me to merge these pairs into rows.
As such, the first row would be index 1 and from then on the index will increase by 1, unless either id1 or id2 match any of the values in previous rows. Where this is the case, the previously attributed index should be applied.
I have looked for weeks and most solutions seem to fall short of what I need.
Here's some data to replicate what I have:
id1 <- c(1,2,2,4,6,7,9,11)
id2 <- c(2,3,4,5,7,8,10,2)
df <- cbind(id1,id2)
df <- as.data.frame(df)
df
id1 id2
1 1 2
2 2 3
3 2 4
4 4 5
5 6 7
6 7 8
7 9 10
8 11 2
And here's what hope to achieve:
#wanted result
index <- c(1,1,1,1,2,2,3,1)
df_indexed <- cbind(df,index)
df_indexed
id1 id2 index
1 1 2 1
2 2 3 1
3 2 4 1
4 4 5 1
5 6 7 2
6 7 8 2
7 9 10 3
8 11 2 1
It may be easier to do in igraph
library(igraph)
g <- graph.data.frame(df)
df$index <- clusters(g)$membership[as.character(df$id1)]
df$index
#[1] 1 1 1 1 2 2 3 1
I have a long format dataframe with multiple subjects and multiple conditions for each subject.
I want to remove the first row of each condition (except the first one) for all subjects.
My dataframe looks like this:
> df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)), cond = (rep(c("A", "A", "B", "B"),times=3)), value = round(runif(12, min = 0, max = 10)))
> df
subj cond value
1 A 1
1 A 5
1 B 3
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
I have found the duplicated() function but it only removes the first row of each condition for the first subject:
df <- df[duplicated(df$cond),]
subj cond value
1 A 5
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
Is there a way to "reset" the finding of a duplicate whenever a new subject begins?
And how can I stop it from excluding the first row of the first condition?
Thank you all so much!
You could subset with the duplicated interaction of the two variables:
> df
subj cond value
1 1 A 5
2 1 A 7
3 1 B 4
4 1 B 8
5 2 A 5
6 2 A 2
7 2 B 8
8 2 B 5
9 3 A 8
10 3 A 1
11 3 B 1
12 3 B 5
df1 <- df[!duplicated(interaction(df$subj, df$cond)),]
> df1
subj cond value
1 1 A 5
3 1 B 4
5 2 A 5
7 2 B 8
9 3 A 8
11 3 B 1
Edit:
I've read your question again and it seems you want to remove the first row, not the last. In this case, use
df1 <- df[!duplicated(interaction(df$subj, df$cond), fromLast = TRUE),]
> df1
subj cond value
2 1 A 4
4 1 B 9
6 2 A 9
8 2 B 7
10 3 A 1
12 3 B 2
Alternative (but does depend on actual df):
df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)),
cond = (rep(c("A", "A", "B", "B"),times=3)),
value = round(runif(12, min = 0, max = 10)))
df
dummy <- as.character(df$cond) # factor to character
mask <- c(FALSE, dummy[-1] == dummy[-length(dummy)])
df[mask,]
how do I extract specific row of data when the column has repetitive value? my data looks like this: I want to extract the row of the end of each repeat of x (A 3 10, A 2 3 etc) or the index of the last value
Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3
Expected output
Index Name X M
3 A 3 10
5 A 2 3
10 A 5 3
13 B 3 10
15 B 2 3
Using base R duplicated and cumsum:
dups <- !duplicated(cumsum(dat$X == 1), fromLast=TRUE)
cbind(dat[dups,], Index=which(dups))
# Name X M Index
#3 A 3 10 3
#5 A 2 3 5
#10 A 5 3 10
#13 B 3 10 13
#15 B 2 3 15
A solution using dplyr.
library(dplyr)
df2 <- df %>%
mutate(Flag = ifelse(lead(X) < X, 1, 0)) %>%
mutate(Index = 1:n()) %>%
filter(Flag == 1 | is.na(Flag)) %>%
select(Index, X, M)
df2
# Index X M
# 1 3 3 10
# 2 5 2 3
# 3 10 5 3
# 4 13 3 10
# 5 15 2 3
Flag is a column showing if the next number in A is smaller than the previous number. If TRUE, Flag is 1, otherwise is 0. We can then filter for Flag == 1 or where Flag is NA, which is the last row. df2 is the final filtered data frame.
DATA
df <- read.table(text = "Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3",
header = TRUE, stringsAsFactors = FALSE)
How can I get the expected calculation using dplyr package?
row value group expected
1 2 1 =NA
2 4 1 =4-2
3 5 1 =5-4
4 6 2 =NA
5 11 2 =11-6
6 12 1 =NA
7 15 1 =15-12
I tried
df=read.table(header=1, text=' row value group
1 2 1
2 4 1
3 5 1
4 6 2
5 11 2
6 12 1
7 15 1')
df %>% group_by(group) %>% mutate(expected=value-lag(value))
How can I calculate for each chunk (row 1-3, 4-5, 6-7) although row 1-3 and 6-7 are labelled as the same group number?
Here is a similar approach. I created a new group variable using cumsum. Whenever the difference between two numbers in group is not 0, R assigns a new group number. If you have more data, this approach may be helpful.
library(dplyr)
mutate(df, foo = cumsum(c(T, diff(group) != 0))) %>%
group_by(foo) %>%
mutate(out = value - lag(value))
# row value group foo out
#1 1 2 1 1 NA
#2 2 4 1 1 2
#3 3 5 1 1 1
#4 4 6 2 2 NA
#5 5 11 2 2 5
#6 6 12 1 3 NA
#7 7 15 1 3 3
As your group variable is not useful for this, create a new variable aux and use it as the grouping variable:
library(dplyr)
df$aux <- rep(seq_along(rle(df$group)$values), times = rle(df$group)$lengths)
df %>% group_by(aux) %>% mutate(expected = value - lag(value))
Source: local data frame [7 x 5]
Groups: aux
row value group aux expected
1 1 2 1 1 NA
2 2 4 1 1 2
3 3 5 1 1 1
4 4 6 2 2 NA
5 5 11 2 2 5
6 6 12 1 3 NA
7 7 15 1 3 3
Here is an option using data.table_1.9.5. The devel version introduced new functions rleid and shift (default type is "lag" and fill is "NA") that can be useful for this.
library(data.table)
setDT(df)[, expected:=value-shift(value) ,by = rleid(group)][]
# row value group expected
#1: 1 2 1 NA
#2: 2 4 1 2
#3: 3 5 1 1
#4: 4 6 2 NA
#5: 5 11 2 5
#6: 6 12 1 NA
#7: 7 15 1 3