Removing Only Adjacent Duplicates in Data Frame in R - r

I have a data frame in R that is supposed to have duplicates. However, there are some duplicates that I would need to remove. In particular, I only want to remove row-adjacent duplicates, but keep the rest. For example, suppose I had the data frame:
df = data.frame(x = c("A", "B", "C", "A", "B", "C", "A", "B", "B", "C"),
y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10))
This results in the following data frame
x y
A 1
B 2
C 3
A 4
B 5
C 6
A 7
B 8
B 9
C 10
In this case, I expect there to be repeating "A, B, C, A, B, C, etc.". However, it is only a problem if I see adjacent row duplicates. In my example above, that would be rows 8 and 9 with the duplicate "B" being adjacent to each other.
In my data set, whenever this occurs, the first instance is always a user-error, and the second is always the correct version. In very rare cases, there might be an instance where the duplicates occur 3 (or more) times. However, in every case, I would always want to keep the last occurrence. Thus, following the example from above, I would like the final data set to look like
A 1
B 2
C 3
A 4
B 5
C 6
A 7
B 9
C 10
Is there an easy way to do this in R? Thank you in advance for your help!
Edit: 11/19/2014 12:14 PM EST
There was a solution posted by user Akron (spelling?) that has since gotten deleted. I am now sure why because it seemed to work for me?
The solution was
df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
It seems to work for me, why did it get deleted? For example, in cases with more than 2 consecutive duplicates:
df = data.frame(x = c("A", "B", "B", "B", "C", "C", "C", "A", "B", "C", "A", "B", "B", "C"), y = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14))
x y
1 A 1
2 B 2
3 B 3
4 B 4
5 C 5
6 C 6
7 C 7
8 A 8
9 B 9
10 C 10
11 A 11
12 B 12
13 B 13
14 C 14
> df = df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
> df
x y
1 A 1
4 B 4
7 C 7
8 A 8
9 B 9
10 C 10
11 A 11
13 B 13
14 C 14
This seems to work?

Try
df[with(df, c(x[-1]!= x[-nrow(df)], TRUE)),]
# x y
#1 A 1
#2 B 2
#3 C 3
#4 A 4
#5 B 5
#6 C 6
#7 A 7
#9 B 9
#10 C 10
Explanation
Here, we are comparing an element with the element preceding it. This can be done by removing the first element from the column and that column compared with the column from which last element is removed (so that the lengths become equal)
df$x[-1] #first element removed
#[1] B C A B C A B B C
df$x[-nrow(df)]
#[1] A B C A B C A B B #last element `C` removed
df$x[-1]!=df$x[-nrow(df)]
#[1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
In the above, the length is 1 less than the nrow of df as we removed one element. Inorder to compensate that, we can concatenate a TRUE and then use this index for subsetting the dataset.

Here's an rle solution:
df[cumsum(rle(as.character(df$x))$lengths), ]
# x y
# 1 A 1
# 2 B 2
# 3 C 3
# 4 A 4
# 5 B 5
# 6 C 6
# 7 A 7
# 9 B 9
# 10 C 10
Explanation:
RLE stands for Run Length Encoding. It produces a list of vectors. One being the runs, the values, and the other lengths being the number of consecutive repeats of each value. For example, x <- c(3, 2, 2, 3) has a runs vector of c(3, 2, 3) and lengths c(1, 2, 1). In this example, the cumulative sum of the lengths produces c(1, 3, 4). Subset x with this vector and you get c(3, 2, 3). Note that the second element of the lengths vector is the third element of the vector and the last occurrence of 2 in that particular 'run'.

You could also try
df[c(diff(as.numeric(df$x)), 1) != 0, ]
In case x is of character class (rather than factor), try
df[c(diff(as.numeric(factor(df$x))), 1) != 0, ]
# x y
# 1 A 1
# 2 B 2
# 3 C 3
# 4 A 4
# 5 B 5
# 6 C 6
# 7 A 7
# 9 B 9
# 10 C 10

Related

Add column inside named matrix

Suppose I have the following matrix:
m <- matrix(1:12, nrow = 3, dimnames = list(c("a", "b", "c"), c("w", "x", "y", "z")))
# w x y z
# a 1 4 7 10
# b 2 5 8 11
# c 3 6 9 12
How can I add a column with the values c(13, 14, 15) between column x and y without knowing where x and y are?
Using number ranges I know how to do this using cbind.
cbind(m[,1:2], c(13, 14, 15), m[,3:4])
# w x y z
# a 1 4 13 7 10
# b 2 5 14 8 11
# c 3 6 15 9 12
For named columns, it'd be neat if I could supply the column ranges with m[,:"x"] and m[,"y":] of some sort, but unfortunately that doesn't work.
Additionally, if possible, giving that column its own header name during the insertion process would be nice.
EDIT: I should have specified that x and y always are in order, so adding the column after x would have been enough. Thanks for the more general answers as well!
When you can not assume that x comes before y and there is no need that they are following each without a gap you can try:
i <- seq_len(min(match(c("x", "y"), colnames(m))))
cbind(m[,i], v=c(13, 14, 15), m[,-i])
# w x v y z
#a 1 4 13 7 10
#b 2 5 14 8 11
#c 3 6 15 9 12
In case they are ordered, that it will be enough to put it after x like:
i <- seq_len(match("x", colnames(m)))
cbind(m[,i], v=c(13, 14, 15), m[,-i])
you may found the columns positions by names and insert the new column properly:
x_pos <- which(colnames(m) == "x")
y_pos <- which(colnames(m) == "y")
m <- cbind(m[,1:x_pos], new=c(13, 14, 15), m[,y_pos:ncol(m)])
You can use which to find the desired column and assign a name in cbind, i.e.
cbind(m[, seq(which(colnames(m) == 'x'))],
w = c(13, 14, 15),
m[, (which(colnames(m) == 'y'):ncol(m))])
# w x w y z
#a 1 4 13 7 10
#b 2 5 14 8 11
#c 3 6 15 9 12
It's not exactly pretty but you can do this
cbind(m[,1:(which(dimnames(m)[[2]]=="x"))],
t=c(13, 14, 15),
m[,(which(dimnames(m)[[2]]=="y")):dim(m)[2]])
You can use this function :
insert_a_column <- function(mat, first_col,second_col, new_col, vec) {
#Get index of first column to match
one <- match(first_col, colnames(mat))
#Get index of second column to match
two <- match(second_col, colnames(mat))
#Add the middle column and combine the data
new_mat <- cbind(mat[,1:one, drop = FALSE], vec,
mat[, (one + 1):ncol(mat), drop = FALSE])
#rename the new column
colnames(new_mat)[one + 1] <- new_col
#Return the matrix.
return(new_mat)
}
insert_a_column(m, "x", "y", "a", c(13, 14, 15))
# w x a y z
#a 1 4 13 7 10
#b 2 5 14 8 11
#c 3 6 15 9 12
insert_a_column(m, "y", "z", "a", c(13, 14, 15))
# w x y a z
#a 1 4 7 13 10
#b 2 5 8 14 11
#c 3 6 9 15 12

Randomly sample values from a pool so that the sum is less than a threshold in R

Let's say we have a pool of values and I want to sample random number of values from this pool, so that the sum of these values is between two thresholds. I want to design a function in R to implemented that.
pool = data.frame(ID = letters, value = sample(1:5, size = 26, replace = T))
> print(pool)
ID value
1 a 1
2 b 4
3 c 4
4 d 2
5 e 2
6 f 4
7 g 5
8 h 5
9 i 4
10 j 3
11 k 3
12 l 5
13 m 3
14 n 2
15 o 3
16 p 4
17 q 1
18 r 1
19 s 5
20 t 1
21 u 2
22 v 4
23 w 5
24 x 2
25 y 4
26 z 1
I want to randomly sample what ever number of IDs so that the sum of values for these IDs are between two thresholds, let's say between 8 and 10 (including the two boundaries). The expected outcome should be like these:
c("a", "b", "c")
c("f", "g")
c("a", "d", "e", "j", "k")
I think this question has not been asked previously. Does anyone have clues?
Here's an approach where I shuffle the input and check the cumulative sum of the shuffled output to look for an acceptable sum.
If a subset of that initial sequence happens to work, it outputs that sequence (in this manifestation, the longest sequence under the max threshold). If it doesn't work, it reshuffles and looks again, up to the max number of iterations.
set.seed(42)
library(dplyr)
sample_in_range <- function(src_tbl, min_sum = 8, max_sum = 10, max_iter = 100) {
for(i in 1:max_iter) {
output <- src_tbl %>%
sample_n(nrow(src_tbl)) %>%
mutate(ID = as.character(ID),
cuml = cumsum(value)) %>%
filter(cuml <= max_sum)
if(max(output$cuml) >= min_sum) return(output)
}
}
output <- sample_in_range(pool)
output
ID value cuml
1 k 3 3
2 w 2 5
3 z 4 9
4 t 1 10
output %>% pull(ID)
[1] "k" "w" "z" "t"

Correction for multiple testing for very large files with repetitions

I have 10 files with size ~8-9 Gb like:
7 72603 0.0780181622612
15 72603 0.027069072329
20 72603 0.00215643186987
24 72603 0.00247965378216
29 72603 0.0785606184492
32 72603 0.0486866833899
33 72603 0.000123332654879
For each pair of numbers (1st and 2nd column) I have p-value (3rd column).
However, I have repeated pairs (they can be in different files) and I want to get rid of one of them. If the files were smaller, I would use pandas. E.g.:
7 15 0.0012423442
...
15 7 0.0012423442
Also I want to apply to this set a correction for multiple testing, but the vector of values is very large.
Is it possible to do this with Python or R?
> df <- data.frame(V1 = c("A", "A", "B", "B", "C", "C"),
+ V2 = c("B", "C", "A", "C", "A", "B"),
+ n = c(1, 3, 1, 2, 3, 2))
> df
V1 V2 n
1 A B 1
2 A C 3
3 B A 1
4 B C 2
5 C A 3
6 C B 2
> df[!duplicated(t(apply(df, 1, sort))), ]
V1 V2 n
1 A B 1
2 A C 3
4 B C 2

Filter data.frame based on rowwise NA count

I would like to filter a data.frame based on the number of NA's in each row.
If I start with the following,
> d
A B C E
1 2 2 6 7
2 4 9 NA 10
3 6 NA NA 4
4 9 7 1 8
I would like to filter d to remove rows with 2 or more NA's in columns A, B, and C to yield:
A B C E
1 2 2 6 7
2 4 9 NA 10
4 9 7 1 8
We could use rowSums with is.na on the subset of columns of dataset to subset the rows
d[rowSums(is.na(d[1:3]))<2,]
# A B C E
#1 2 2 6 7
#2 4 9 NA 10
#4 9 7 1 8
The d[1:3] select the dataset with only 'A, B, C' columns. Applying is.na converts it to a logical matrix of TRUE/FALSE, do the sum of the TRUE values in each row with rowSums, and finally check if that number is less than 2 to get a logical vector which we use for subsetting the rows.
An alternative would be to use Reduce with +
d[Reduce(`+`,lapply(d[1:3], is.na)) <2,]
For reproducibility, define a data.frame as below with various numbers of NAs in each row.
df <- data.frame(
A = c(1, 2, 3, NA),
B = c(1, 2, NA, NA),
C = c(1, NA, NA, NA),
E = c(5, 6, 7, 8)
)
Define a function that counts the number if NA's in a given row:
countNA <- function(df) apply(df, MARGIN = 1, FUN = function(x) length(x[is.na(x)]))
Based on the wording of the question, exclude column E from this calculation:
df_noE <- subset(df, select=-E)
Now count NAs in each row using the function above:
na_count <- countNA(df_noE)
Now filter the original data.frame with this count:
df[na_count < 2,]
All together in a single line:
df[countNA(subset(df, select=-E)) < 2,]

cbind warnings : row names were found from a short variable and have been discarded

I have below line of code for cbind, but I am getting a warning message everytime.
Though the code still functions as it should be, is there any way to resolve the warning?
dateset = subset(all_data[,c("VAR1","VAR2","VAR3","VAR4","VAR5","RATE1","RATE2","RATE3")])
dateset = cbind(dateset[c(1,2,3,4,5)],stack(dateset[,-c(1,2,3,4,5)]))
Warnings :
Warning message:
In data.frame(..., check.names = FALSE) :
row names were found from a short variable and have been discarded
Thanks in advance!
I'm guessing your data.frame has row.names:
A <- data.frame(a = c("A", "B", "C"),
b = c(1, 2, 3),
c = c(4, 5, 6),
row.names=c("A", "B", "C"))
cbind(A[1], stack(A[-1]))
# a values ind
# 1 A 1 b
# 2 B 2 b
# 3 C 3 b
# 4 A 4 c
# 5 B 5 c
# 6 C 6 c
# Warning message:
# In data.frame(..., check.names = FALSE) :
# row names were found from a short variable and have been discarded
What's happening here is that since you can't by default have duplicated row.names in a data.frame and since you don't tell R at any point to duplicate the row.names when recycling the first column to the same number of rows of the stacked column, R just discards the row.names.
Compare with a similar data.frame, but one without row.names:
B <- data.frame(a = c("A", "B", "C"),
b = c(1, 2, 3),
c = c(4, 5, 6))
cbind(B[1], stack(B[-1]))
# a values ind
# 1 A 1 b
# 2 B 2 b
# 3 C 3 b
# 4 A 4 c
# 5 B 5 c
# 6 C 6 c
Alternatively, you can set row.names = NULL in your cbind statement:
cbind(A[1], stack(A[-1]), row.names = NULL)
# a values ind
# 1 A 1 b
# 2 B 2 b
# 3 C 3 b
# 4 A 4 c
# 5 B 5 c
# 6 C 6 c
If your original row.names are important, you can also add them back in with:
cbind(rn = rownames(A), A[1], stack(A[-1]), row.names = NULL)
# rn a values ind
# 1 A A 1 b
# 2 B B 2 b
# 3 C C 3 b
# 4 A A 4 c
# 5 B B 5 c
# 6 C C 6 c

Resources