r - remove first row of condition per subject in dataframe - r

I have a long format dataframe with multiple subjects and multiple conditions for each subject.
I want to remove the first row of each condition (except the first one) for all subjects.
My dataframe looks like this:
> df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)), cond = (rep(c("A", "A", "B", "B"),times=3)), value = round(runif(12, min = 0, max = 10)))
> df
subj cond value
1 A 1
1 A 5
1 B 3
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
I have found the duplicated() function but it only removes the first row of each condition for the first subject:
df <- df[duplicated(df$cond),]
subj cond value
1 A 5
1 B 10
2 A 6
2 A 5
2 B 2
2 B 0
3 A 5
3 A 8
3 B 5
3 B 2
Is there a way to "reset" the finding of a duplicate whenever a new subject begins?
And how can I stop it from excluding the first row of the first condition?
Thank you all so much!

You could subset with the duplicated interaction of the two variables:
> df
subj cond value
1 1 A 5
2 1 A 7
3 1 B 4
4 1 B 8
5 2 A 5
6 2 A 2
7 2 B 8
8 2 B 5
9 3 A 8
10 3 A 1
11 3 B 1
12 3 B 5
df1 <- df[!duplicated(interaction(df$subj, df$cond)),]
> df1
subj cond value
1 1 A 5
3 1 B 4
5 2 A 5
7 2 B 8
9 3 A 8
11 3 B 1
Edit:
I've read your question again and it seems you want to remove the first row, not the last. In this case, use
df1 <- df[!duplicated(interaction(df$subj, df$cond), fromLast = TRUE),]
> df1
subj cond value
2 1 A 4
4 1 B 9
6 2 A 9
8 2 B 7
10 3 A 1
12 3 B 2

Alternative (but does depend on actual df):
df <- data.frame(subj = c(rep(1,4),rep(2,4), rep(3,4)),
cond = (rep(c("A", "A", "B", "B"),times=3)),
value = round(runif(12, min = 0, max = 10)))
df
dummy <- as.character(df$cond) # factor to character
mask <- c(FALSE, dummy[-1] == dummy[-length(dummy)])
df[mask,]

Related

How to assign a value to a column based on a column index

Having a data frame I would like to assign a calculated value based on a given a column index
df <- data.frame(a = c(2,4,7,3,5,3), b = c(8,3,8,2,6,1))
> df
a b
1 2 8
2 4 3
3 7 8
4 3 2
5 5 6
6 3 1
max <- apply(df, 1, which.max)
> max
[1] 2 1 2 1 2 1
addition <- apply(df, 1, sum)
> addition
[1] 10 7 15 5 11 4
Then some operation which I cannot figure out with the following result being assigned to df2
> df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
highly appreciate your ideas and your help. Thank you
You can use cbind to access your selected columns for each row:
df2 = df
df2[cbind(1:nrow(df2),max)] = addition
df2
a b
1 2 10
2 7 3
3 7 15
4 5 2
5 5 11
6 4 1
Here, cbind returns a matrix of 2 columns and 6 rows that we use to subset the dataframe using matrix subsetting.
You can also use vectorised ifelse directly:
with(df, cbind.data.frame(a = ifelse(a > b, a + b, a), b = ifelse(a > b, b, a + b)));
# a b
#1 2 10
#2 7 3
#3 7 15
#4 5 2
#5 5 11
#6 4 1

Extract Index of repeat value

how do I extract specific row of data when the column has repetitive value? my data looks like this: I want to extract the row of the end of each repeat of x (A 3 10, A 2 3 etc) or the index of the last value
Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3
Expected output
Index Name X M
3 A 3 10
5 A 2 3
10 A 5 3
13 B 3 10
15 B 2 3
Using base R duplicated and cumsum:
dups <- !duplicated(cumsum(dat$X == 1), fromLast=TRUE)
cbind(dat[dups,], Index=which(dups))
# Name X M Index
#3 A 3 10 3
#5 A 2 3 5
#10 A 5 3 10
#13 B 3 10 13
#15 B 2 3 15
A solution using dplyr.
library(dplyr)
df2 <- df %>%
mutate(Flag = ifelse(lead(X) < X, 1, 0)) %>%
mutate(Index = 1:n()) %>%
filter(Flag == 1 | is.na(Flag)) %>%
select(Index, X, M)
df2
# Index X M
# 1 3 3 10
# 2 5 2 3
# 3 10 5 3
# 4 13 3 10
# 5 15 2 3
Flag is a column showing if the next number in A is smaller than the previous number. If TRUE, Flag is 1, otherwise is 0. We can then filter for Flag == 1 or where Flag is NA, which is the last row. df2 is the final filtered data frame.
DATA
df <- read.table(text = "Name X M
A 1 1
A 2 9
A 3 10
A 1 1
A 2 3
A 1 5
A 2 6
A 3 4
A 4 5
A 5 3
B 1 1
B 2 9
B 3 10
B 1 1
B 2 3",
header = TRUE, stringsAsFactors = FALSE)

Remove semi duplicate rows in R

I have the following data.frame.
a <- c(rep("A", 3), rep("B", 3), rep("C",2), "D")
b <- c(NA,1,2,4,1,NA,2,NA,NA)
c <- c(1,1,2,4,1,1,2,2,2)
d <- c(1,2,3,4,5,6,7,8,9)
df <-data.frame(a,b,c,d)
a b c d
1 A NA 1 1
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
6 B NA 1 6
7 C 2 2 7
8 C NA 2 8
9 D NA 2 9
I want to remove duplicate rows (based on column A & C) so that the row with values in column B are kept. In this example, rows 1, 6, and 8 are removed.
One way to do this is to order by 'a', 'b' and the the logical vector based on 'b' so that all 'NA' elements will be last for each group of 'a', and 'b'. Then, apply the duplicated and keep only the non-duplicate elements
df1 <- df[order(df$a, df$b, is.na(df$b)),]
df2 <- df1[!duplicated(df1[c('a', 'c')]),]
df2
# a b c d
#2 A 1 1 2
#3 A 2 2 3
#5 B 1 1 5
#4 B 4 4 4
#7 C 2 2 7
#9 D NA 2 9
setdiff(seq_len(nrow(df)), row.names(df2) )
#[1] 1 6 8
First create two datasets, one with duplicates in column a and one without duplicate in column a using the below function :
x = df[df$a %in% names(which(table(df$a) > 1)), ]
x1 = df[df$a %in% names(which(table(df$a) ==1)), ]
Now use na.omit function on data set x to delete the rows with NA and then rbind x and x1 to the final data set.
rbind(na.omit(x),x1)
Answer:
a b c d
2 A 1 1 2
3 A 2 2 3
4 B 4 4 4
5 B 1 1 5
7 C 2 2 7
9 D NA 2 9
You can use dplyr to do this.
df %>% distinct(a, c, .keep_all = TRUE)
Output
a b c d
1 A NA 1 1
2 A 2 2 3
3 B 4 4 4
4 B 1 1 5
5 C 2 2 7
6 D NA 2 9
There are other options in dplyr, check this question for details: Remove duplicated rows using dplyr

Creating new dataframe with missing value

i have a dataframe structured like this
time <- c(1,1,1,1,2,2)
group <- c('a','b','c','d','c','d')
number <- c(2,3,4,1,2,12)
df <- data.frame(time,group,number)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 c 2
6 2 d 12
in order to plot the data i need it to contain the values for each group (from a-d) at each time interval, even if they equal zero. so a data frame looking like this:
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a 0
6 2 b 0
7 2 c 2
8 2 d 12
any help?
You can use expand.grid and merge, like this:
> merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all = TRUE)
time group number
1 1 a 2
2 1 b 3
3 1 c 4
4 1 d 1
5 2 a NA
6 2 b NA
7 2 c 2
8 2 d 12
From there, it's just a simple matter of replacing NA with 0.
new <- merge(df, expand.grid(lapply(df[c(1, 2)], unique)), all.y = TRUE)
new[is.na(new$number),"number"] <- 0
new

Combining values from two different data frames

I have two data frames like this:
df.1 <- data.frame(
var.1 = sample(1:10),
code = sample(c("A", "B", "C"), 10, replace = TRUE))
df.2 <- data.frame(
var.2 = sample(1:3),
row.names=c("A","B","C"))
What I need to do is to add a third column df.1$var.2 which, for each value in df.1$code take the value from df.2$var.2 accordingly to their row name.
I got to this point but with no success.. Suggestions?
for (i in 1:length(df.1$code)){
if(df.1$code[i] == rownames(df.2))
df.1$var.2[i] <- df.2$var.2
}
You mean like this:
df.2$code <- rownames(df.2)
> merge(df.1,df.2,by = "code")
code var.1 var.2
1 A 5 1
2 B 3 2
3 B 2 2
4 B 7 2
5 B 10 2
6 C 8 3
7 C 4 3
8 C 1 3
9 C 9 3
10 C 6 3
Or, join() from the plyr package to preserve the order of df.1
df.2$code <- rownames(df.2)
library(plyr)
join(df.1, df.2, by = "code")
var.1 code var.2
1 7 B 2
2 2 A 1
3 3 C 3
4 6 B 2
5 10 C 3
6 4 C 3
7 1 C 3
8 8 B 2
9 9 A 1
10 5 C 3

Resources