How to remove rows based on the column values - r

I have a large data.frame, example:
> m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
> colnames(m) <- c("A", "B", "C", "D", "E")
> rownames(m) <- c("a", "b", "c", "d", "e")
> m
A B C D E
a 3 3 5 7 5
b 6 2 3 5 4
c 2 5 6 8 9
d 5 4 3 2 2
e 3 3 6 5 2
I would like to remove all rows, where A and/or B columns have greater value than C D and E columns.
So in this case rows b, d, e should be removed and I should get this:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
Can not remove them one by one because the data.frame has more than a million rows.
Thanks

Use subsetting, together with pmin() and pmax() to retain the values that you want. I'm not sure that I fully understand your criteria (you said "C D and E" but since you want to throw away row e, I think that you meant C, D or E ), but the following seems to do what you want:
> m[pmax(m[,"A"],m[,"B"])<=pmin(m[,"C"],m[,"D"],m[,"E"]),]
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9

# creating the df
m <- matrix(c(3,6,2,5,3,3,2,5,4,3,5,3,6,3,6,7,5,8,2,5,5,4,9,2,2), nrow=5, ncol=5)
colnames(m) <- c("A", "B", "C", "D", "E")
rownames(m) <- c("a", "b", "c", "d", "e")
# initialize as data frame.
m <- as.data.frame(m)
df_n <- m
for(i in 1:nrow(m)){
#print(i)
#print(paste(max(m[,1:2][i,]), max(m[,3:5][i,])))
if(max(m[,1:2][i,]) > (max(m[,3:4][i,])) || max(m[,1:2][i,]) > ((m[,5])[i])){
#df_n <- m[-i,]
df_n[i,] <- NA
}
}
#df_n
df_n <- df_n[complete.cases(df_n), ]
print(df_n)
Results
> print(df_n)
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9

Here's another solution with apply:
m[apply(m, 1, function(x) max(x[1], x[2]) < min(x[3], x[4], x[5])),]
Result:
A B C D E
a 3 3 5 7 5
c 2 5 6 8 9
I think what you actually meant is to remove rows where max(A, B) > min(C, D, E), which translates to keep rows where all values of A and B are smaller than all values of C, D, and E.

Related

How do I create a smaller matrix out of a larger dataset? [duplicate]

Is it possible to get a matrix column by name from a matrix?
I tried various approaches such as myMatrix["test", ] but nothing seems to work.
Yes. But place your "test" after the comma if you want the column...
> A <- matrix(sample(1:12,12,T),ncol=4)
> rownames(A) <- letters[1:3]
> colnames(A) <- letters[11:14]
> A[,"l"]
a b c
6 10 1
see also help(Extract)
> myMatrix <- matrix(1:10, nrow=2)
> rownames(myMatrix) <- c("A", "B")
> colnames(myMatrix) <- c("A", "B", "C", "D", "E")
> myMatrix
A B C D E
A 1 3 5 7 9
B 2 4 6 8 10
> myMatrix["A", "A"]
[1] 1
> myMatrix["A", ]
A B C D E
1 3 5 7 9
> myMatrix[, "A"]
A B
1 2

How to extract column in R function using variable name? [duplicate]

Is it possible to get a matrix column by name from a matrix?
I tried various approaches such as myMatrix["test", ] but nothing seems to work.
Yes. But place your "test" after the comma if you want the column...
> A <- matrix(sample(1:12,12,T),ncol=4)
> rownames(A) <- letters[1:3]
> colnames(A) <- letters[11:14]
> A[,"l"]
a b c
6 10 1
see also help(Extract)
> myMatrix <- matrix(1:10, nrow=2)
> rownames(myMatrix) <- c("A", "B")
> colnames(myMatrix) <- c("A", "B", "C", "D", "E")
> myMatrix
A B C D E
A 1 3 5 7 9
B 2 4 6 8 10
> myMatrix["A", "A"]
[1] 1
> myMatrix["A", ]
A B C D E
1 3 5 7 9
> myMatrix[, "A"]
A B
1 2

column reference [,1] works but column name x$col1 does not [duplicate]

Is it possible to get a matrix column by name from a matrix?
I tried various approaches such as myMatrix["test", ] but nothing seems to work.
Yes. But place your "test" after the comma if you want the column...
> A <- matrix(sample(1:12,12,T),ncol=4)
> rownames(A) <- letters[1:3]
> colnames(A) <- letters[11:14]
> A[,"l"]
a b c
6 10 1
see also help(Extract)
> myMatrix <- matrix(1:10, nrow=2)
> rownames(myMatrix) <- c("A", "B")
> colnames(myMatrix) <- c("A", "B", "C", "D", "E")
> myMatrix
A B C D E
A 1 3 5 7 9
B 2 4 6 8 10
> myMatrix["A", "A"]
[1] 1
> myMatrix["A", ]
A B C D E
1 3 5 7 9
> myMatrix[, "A"]
A B
1 2

Replace values in named vector with values from another named vector in R

I have two vectors, say :
x <- c(2, 3, 5, 7, 9, 11)
names(x) <- c("a", "b", "c", "d", "e", "f")
y <- c(33,44,55)
names(y) <- c("b", "d", "f")
so that x is
a b c d e f
2 3 5 7 9 11
and y is
b d f
33 44 55
I want to replace the values in x with values in y that have the same name so that the result would be for the new x:
a b c d e f
2 33 5 44 9 55
I'm sure this has been answered somewhere but I can't find it.
You can use the names of y as a subset on x, then replace with y.
x[names(y)] <- y
x
# a b c d e f
# 2 33 5 44 9 55
Another option is replace(), which basically does the same as above but returns the result and does not change x.
replace(x, names(y), y)
# a b c d e f
# 2 33 5 44 9 55

Extract n rows after string in R

I Would like to extract the next 'n' rows after I find a string in R.
For example, let's say I have the following data frame:
df<-as.data.frame(rep(c("a","b","c","d","e","f"),10))
I would like to extract every row that includes "b", as well as the next two rows (in this example, I would like to extract rows with "b", or "c", or "d")
BUT, please, I don't want to specify "c" and "d", I just want the next two rows after "b" as well (in my real data the next two rows are not consistent).
I've tried many things, but no success.. Thanks in advance! Nick
You can find the indices of rows with b and then use those and the next two of each, something like this:
df <- data.frame(col1=rep(c("a","b","c","d","e","f"),3), col2=letters[1:18], stringsAsFactors = FALSE)
df
col1 col2
1 a a
2 b b
3 c c
4 d d
5 e e
6 f f
7 a g
8 b h
9 c i
10 d j
11 e k
12 f l
13 a m
14 b n
15 c o
16 d p
17 e q
18 f r
bs <- which(df$col1=="b")
df[sort(bs+rep(0:2, each=length(bs)),] #2 is the number of rows you want after your desired match (b).
col1 col2
2 b b
3 c c
4 d d
8 b h
9 c i
10 d j
14 b n
15 c o
16 d p
I added a second column to illustrate the dataframe better, otherwise a vector would be returned.
My "SOfun" package has a function called getMyRows which does what you ask for, with the exception of returning a list instead of a data.frame.
I had left the result as a list to make it easier to handle some edge cases, like where the requests for rows would overlap. For example, in the following sample data, there are two consecutive "b" values. There's also a "b" value in the final row.
df <- data.frame(col1 = c("a", "b", "b",
rep(c("a", "b", "c", "d", "e", "f"), 3), "b"),
col2 = letters[1:22])
library(SOfun)
getMyRows(df, which(df$col1 == "b"), 0:2, TRUE)
# [[1]]
# col1 col2
# 2 b b
# 3 b c
# 4 a d
#
# [[2]]
# col1 col2
# 3 b c
# 4 a d
# 5 b e
#
# [[3]]
# col1 col2
# 5 b e
# 6 c f
# 7 d g
#
# [[4]]
# col1 col2
# 11 b k
# 12 c l
# 13 d m
#
# [[5]]
# col1 col2
# 17 b q
# 18 c r
# 19 d s
#
# [[6]]
# col1 col2
# 22 b v
The usage is essentially:
Specify the data.frame.
Specify the index positions to use as the base. Here, we want all rows where "col1" equals "b" to be our base index position.
Specify the range of rows interested in. -1:3, for example, would give you one row before to three rows after the base.
TRUE means that you are specifying the starting points by their numeric indices.

Resources