R - Subset rows of a data frame on a condition in all the columns - r

I want to subset rows of a data frame on a single condition in all the columns, avoiding the use of subset.
I understand how to subset a single column but I cannot generalize for all columns (without call all the columns).
Initial data frame :
V1 V2 V3
1 1 8 15
2 2 0 16
3 3 10 17
4 4 11 18
5 5 0 19
6 0 13 20
7 7 14 21
In this example, I want to subset the rows without zeros.
Expected output :
V1 V2 V3
1 1 8 15
2 3 10 17
3 4 11 18
4 7 14 21
Thanks

# create your data
a <- c(1,2,3,4,5,0,7)
b <- c(8,0,10,11,0,14,14)
c <- c(15,16,17,18,19,20,21)
data <- cbind(a, b, c)
# filter out rows where there is at least one 0
data[apply(data, 1, min) > 0,]

A solution using rowSums function after matching to 0.
# creating your data
data <- data.frame(a = c(1,2,3,4,5,0,7),
b = c(8,0,10,11,0,14,14),
c = c(15,16,17,18,19,20,21))
# Selecting rows containing no 0.
data[which(rowSums(as.matrix(data)==0) == 0),]
Another way
data[-unique(row(data)[grep("^0$", unlist(data))]),]

Related

sorting data frame columns based on specific value in each column

I am using the Tidyverse package in R. I have a data frame with 20 rows and 500 columns. I want to sort all the columns based on the size of the value in the last row of each column.
Here is an example with just 3 rows and 4 columns:
1 2 3 4,
5 6 7 8,
8 7 9 1
The desired result is:
3 1 2 4,
7 5 6 8,
9 8 7 1
I searched stack overflow but could not find an answer to this type of question.
If we want to use dplyr from tidyverse, we can use slice to get the last row and then use order in decreasing order to subset columns.
library(dplyr)
df[df %>% slice(n()) %>% order(decreasing = TRUE)]
# V3 V1 V2 V4
#1 3 1 2 4
#2 7 5 6 8
#3 9 8 7 1
Whose translation in base R would be
df[order(df[nrow(df), ], decreasing = TRUE)]
data
df <- read.table(text = "1 2 3 4
5 6 7 8
8 7 9 1")
The following reorders the data frame columns by the order of the last-rows values:
df <- data.frame(col1=c(1,5,8),col2=c(2,6,7),col3=c(3,7,9),col4=c(4,8,1))
last_row <- df[nrow(df),]
df <- df[,order(last_row,decreasing = T)]
First, to get the last rows. Then to sort them with the order() function and to return the reordered columns.
>df
col3 col1 col2 col4
1 3 1 2 4
2 7 5 6 8
3 9 8 7 1

How to remove outiers from multi columns of a data frame

I would like to get a data frame that contains only data that is within 2 SD per each numeric column.
I know how to do it for a single column but how can I do it for a bunch of columns at once?
Here is the toy data frame:
df <- read.table(text = "target birds wolfs Country
3 21 7 a
3 8 4 b
1 2 8 c
1 2 3 a
1 8 3 a
6 1 2 a
6 7 1 b
6 1 5 c",header = TRUE)
Here is the code line for getting only the data that is under 2 SD for a single column(birds).How can I do it for all numeric columns at once?
df[!(abs(df$birds - mean(df$birds))/sd(df$birds)) > 2,]
target birds wolfs Country
2 3 8 4 b
3 1 2 8 c
4 1 2 3 a
5 1 8 3 a
6 6 1 2 a
7 6 7 1 b
8 6 1 5 c
We can use lapply to loop over the dataset columns and subset the numeric vectors (by using a if/else condition) based on the mean and sd.
lapply(df, function(x) if(is.numeric(x)) x[!(abs((x-mean(x))/sd(x))>2)] else x)
EDIT:
I was under the impression that we need to remove the outliers for each column separately. But, if we need to keep only the rows that have no outliers for the numeric columns, we can loop through the columns with lapply as before, instead of returning 'x', we return the sequence of 'x' and then get the intersect of the list element with Reduce. The numeric index can be used for subsetting the rows.
lst <- lapply(df, function(x) if(is.numeric(x))
seq_along(x)[!(abs((x-mean(x))/sd(x))>2)] else seq_along(x))
df[Reduce(intersect,lst),]
I'm guessing that you are trying to filter your data set by checking that all of the numeric columns are within 2 SD (?)
In that case I would suggest to create two filters. 1 one that will indicate numeric columns, the second one that will check that all of them within 2 SD. For the second condition, we can use the built in scale function
indx <- sapply(df, is.numeric)
indx2 <- rowSums(abs(scale(df[indx])) <= 2) == sum(indx)
df[indx2,]
# target birds wolfs Country
# 2 3 8 4 b
# 3 1 2 8 c
# 4 1 2 3 a
# 5 1 8 3 a
# 6 6 1 2 a
# 7 6 7 1 b
# 8 6 1 5 c

Looping through all combinations of row products in a matrix

I have a matrix called Dataset which contains high throughput data like below :
V1 V2 V3
A 2 3 3
B 4 2 7
C 3 1 4
What I want to do is find all combinations of row products for this matrix in the form of a list, as shown below (I'll call it Comb):
V1 V2 V3
A A 4 9 9
A B 8 6 21
A C 6 3 12
B B 16 4 49
B C 12 2 28
C C 9 1 16
What I have so far is the following:
combs <- combn(seq_len(nrow(Dataset)), 2)
Comb <- Dataset[combs[1,], ] * Dataset[combs[2,], ]
rownames(Comb) <- apply(combn(rownames(Dataset), 2), 2, paste, collapse = " ")
Unfortunately, the main problem in using this script I don't get the products of rows that are multiplied by themselves. So using the above script, I would get the following matrix:
V1 V2 V3
A B 8 6 21
A C 6 3 12
B C 12 2 28
So I was wondering if it would be possible to modify the code I have in such a way that it would multiply the values in the same row together? Or would there be another to do this that might be more efficient? When I tried the script on a high throughput dataset (which is fairly large), it seemed to take several seconds to output a list for a table with 1000 rows, so if anyone knows of a way to do this task that might be faster, I'd love to hear your thoughts.
Thanks for your help!
Here is a minimal adaptation of your implementation. We just add to your combn result the values you want, and then pretty much just use the same logic:
r.seq <- seq_len(nrow(Dataset))
combs <- matrix(c(combn(r.seq, 2), rep(r.seq, each=2)), nrow=2) # notice how we add values here
Comb <- Dataset[combs[1,], ] * Dataset[combs[2,], ]
rownames(Comb) <- apply(matrix(rownames(Dataset)[c(combs)], nrow=2), 2, paste, collapse=" ")
Produces:
V1 V2 V3
A B 8 6 21
A C 6 3 12
B C 12 2 28
A A 4 9 9
B B 16 4 49
C C 9 1 16
You can always sort by rowname too. One advantage over expand grid is that this only calculates the combinations you want.
You may apply expand.grid on the sequence of rows ('d1'), subset the 'df1' rows using the columns of ('d1'), multiply. Create an index ('indx') of row names with outer to remove the rows that are not needed.
d1 <- expand.grid(1:nrow(df1), 1:nrow(df1))
res <- df1[d1[,1],] * df1[d1[,2],]
indx <- outer(rownames(df1), rownames(df1), FUN=paste)
v1 <- indx[upper.tri(indx, diag=TRUE)]
res1 <- res[do.call(paste,expand.grid(rownames(df1), rownames(df1))) %in% v1,]
row.names(res1) <- v1
res1
# V1 V2 V3
#A A 4 9 9
#A B 8 6 21
#B B 16 4 49
#A C 6 3 12
#B C 12 2 28
#C C 9 1 16

Keep columns of a data frame based on a data frame

I have a data frame, called df, which contains 4000 values. I have a list of 1000 column numbers, in a data frame called list, which is 1000 rows by 1 column. How can I keep the rows with the numbers in list in the data frame df and throw the rest out. I already tried using:
listv <- as.vector(list)
and then using
dfnew <- df[,listv]
but I get the error
Error in .subset(x, j) : invalid subscript type 'list'
You're mixing up rows and columns subsetting. Here is a minimal example:
df <- data.frame(matrix(1:21, ncol = 3))
df
# X1 X2 X3
# 1 1 8 15
# 2 2 9 16
# 3 3 10 17
# 4 4 11 18
# 5 5 12 19
# 6 6 13 20
# 7 7 14 21
list <- data.frame(V1 = c(1, 4, 6))
list
# V1
# 1 1
# 2 4
# 3 6
df[list[, 1], ]
# X1 X2 X3
# 1 1 8 15
# 4 4 11 18
# 6 6 13 20
df[unlist(list), ]
# X1 X2 X3
# 1 1 8 15
# 4 4 11 18
# 6 6 13 20
Note also that as.vector(list) doesn't create a vector, as you thought it would. You need unlist here (as I used in the last example).

Making a data frame that is a subset of two data frames

I am stumped again.
I have two data frames
dataframe1
a b c
[1] 21 12 22
[2] 11 9 6
[3] 4 6 7
and
dataframe2
f g h
[1] 21 12 22
[2] 11 9 6
[3] 4 6 7
I want to take the first column of dataframe1 and make three new dataframes with the second column being each of the three f,g and h
Obviously I could just do a subset over and over
subset1 <- cbind(dataframe1[,1]dataframe2[,1])
subset2 <- cbind(dataframe1[,1]dataframe2[,2])
but my dataframes will have variable numbers of columns and are very long row numberwise. So I am looking for a little more something general. My data frames will always be the same length.
The closest I have come to getting anything was with apply and cbind but I got either a set of three rows that were a and f, a and g, a and h each combined as single numeric vector or I get a single data frame with four columns, a,f,g,h.
Help is deeply appreciated.
You can use lapply it iterate over the columns of dataframe2 like so:
lapply(dataframe2, function(x) as.data.frame(cbind(dataframe1[,1], x)))
This will result in a list object where each entry corresponds to a column of dataframe2. For example:
$f
V1 x
1 21 21
2 11 11
3 4 4
$g
V1 x
1 21 12
2 11 9
3 4 6
$h
V1 x
1 21 22
2 11 6
3 4 7

Resources