R - Select rows where at least X columns matches condition - r

I am trying to select those rows where at least 4 of the columns have the same value. So far, I have tried the apply function and I can get the rows where any or every row matches.
team.composition[apply(team.composition, 1, function(X) any(as.numeric(X) == 1)),]
This is an example of my table
member.1 member.2 member.3 member.4 member.5
1 3 8 5 3
2 3 2 2 2
7 4 8 8 3
1 8 8 8 8
What I would like is to return the second row (2,3,2,2,2) and the fourth row (1,8,8,8,8).
Any suggestions? Thanks

Try
df1[apply(df1, 1,function(x) any(table(x)>=4)),]
Or
library(reshape2)
df1[!!rowSums(table(melt(as.matrix(df1))[-2])>=4),]

Related

For each row return the multiple column indexs for specific number

Hi suppose I have a matrix with 0 an 1 only and I want to find out where 1 locates in each row. And for each row, there are multiple 1 exist.
For example I have
set.seed(444)
m3 <- matrix(round(runif(8*8)), 8,8)
For the first row I have column 2,3,8 are 1 and I want a code could report either column name or column index. Meanwhile, it is worth to point out that each the number of 1 in each row could be different.
Can anyone provide some suggestions? I appreciate it so much.
We can use which with arr.ind which returns the row/column index as a matrix
out <- which(m3 ==1, arr.ind = TRUE)
out[,2][order(out[,1])]
[1] 2 3 8 3 5 3 4 8 7 4 6 7 1 3 4 6 1 4 5 6 7 2 4 7 8
To get the column name, use the same index (if the matrix have any column names- here there are not column names attribute)
colnames(m3)[out[,2][order(out[,1])]]

Sequence value in data frame column

I need some help writing R
I need to check whether a specif column in a data frame has ascending ordered correctly.
e.g
df$id | df$order | df$any
3 1 a
4 2 a
7 3 b
1 4 b
2 6 a
9 5 a # select this row - out of sequence in df$order
8 7 a
I would like to select the rows that don't follow the ascending sequence. In the example above, that would be the row with df$id equal to 9, because in df$order the value 5 is found after the value 6.
Obs. 1: in df$order, the numbers have range from 1 to N, where N is a number greater than 1.
Obs. 2: If possible I would like to use core libraries to solve the problem.
Any question, just ask on comments
Thanks in advance!
using Base R:
subset(df,c(0,diff(order))<0)
id order any
6 9 5 a
subset(df,c(0,diff(order))>=0)
id order any
1 3 1 a
2 4 2 a
3 7 3 b
4 1 4 b
5 2 6 a
7 8 7 a

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

Removing same values in columns in some rows of a file in R

I have a file like this.
1 3
1 2
1 10
1 5
**5 5**
6 7
8 9
4 6
1 2
**10 10**
......
The file contains thousands of rows. I wanted to know, how can I remove the rows which contains the same values in columns in R ( The row containing 5 5 and row containing 10 10 )? I know how to remove duplicate columns or duplicate rows, but how do I go about selectively removing them? Thanks. :)
I would do this with indexing, example with small data frame:
myDf <- data.frame(a=c(3,5,8,6,9,4,3), b=c(3,3,5,8,9,6,4))
myDf <- myDf[myDf$a != myDf$b,]
I would consider writing a helper function like this:
indicator <- function(indf) {
rowSums(vapply(indf, function(x) x == indf[, 1],
logical(nrow(indf)))) == ncol(indf)
}
Basically, the function compares each column in the data.frame with the first column of the data.frame, then, checks to see which rowSums are the same as the number of columns in the data.frame.
This basically creates a logical vector that can be used to subset your data.frame.
Example:
mydf <- data.frame(a=c(3,5,8,6,9,4,3),
b=c(3,3,5,8,9,6,4),
c=c(3,4,5,6,9,7,2))
indicator(mydf)
# [1] TRUE FALSE FALSE FALSE TRUE FALSE FALSE
mydf[!indicator(mydf), ]
# a b c
# 2 5 3 4
# 3 8 5 5
# 4 6 8 6
# 6 4 6 7
# 7 3 4 2

R data.table selecting the previous row within group blocks

I have the following example data frame.
id value
a 3
a 4
a 8
b 9
b 8
I want to convert it so that I can calculate differences in the column "value" between successive rows. So the expected result is
id value prevValue
a 3 0
a 4 3
a 8 4
b 9 0
b 8 9
Notice within each group I want the sequence of values to start with a 0 and successive values are from the one prior. I tried the following
x = x[,list(
prevValue = c(0,value[1:(.N-1)])
),by=id]
but no luck.
Thanks in advance.
Use negative indexing, something like:
x[,prev.value := c(0,value[-.N]) ,by=id]
Without data.table:
with(dat,ave(value,id,FUN=function(x) c(0,head(x,-1))))
[1] 0 3 4 0 9

Resources