R dataframe filtering - r

I have a dataframe df as follows:
A B C
NA 1 2
2 NA 3
4 5 6
7 8 9
what I want to do is remove all the rows that has NA.
if I use
apply(df,1,function(row) all(!is.na(row)))
I get the list of all the rows with TRUE (if the row does not contain a NA) and FALSE(if the row contains a NA).
But how do I get the rowname such that I can create some like
df2<-df[-c(list of rows that contains NA),]
which will give me all the new dataframe with NA in rows.
Thanks in advance.

Assuming you have a dataframe that looks like this:
A B C
1 NA 1 2
2 2 NA 3
3 4 5 6
4 7 8 9
Then try:
df1[apply(df1,1,function(x) !any(is.na(x))), ]
A B C
3 4 5 6
4 7 8 9
It doesn't use rownames but rather a logical vector. I guess Joshua and I read you question differently but we used the same method.
Joshua's suggestion is more compact:
> na.omit(df1)
A B C
3 4 5 6
4 7 8 9
And it reminds me that I should have used:
> df1[complete.cases(df1), ]
A B C
3 4 5 6
4 7 8 9

You can use the logical vector from your apply call to index your data.frame.
> Data[!apply(Data,1,function(row) all(!is.na(row))),]
A B C
1 NA 1 2
2 2 NA 3
> # or like this:
> Data[apply(Data,1,function(row) any(is.na(row))),]
A B C
1 NA 1 2
2 2 NA 3

is.na on a data.frame returns a matrix, which is a better candidate for apply:
df <- read.table(textConnection(" A B C
NA 1 2
2 NA 3
4 5 6
7 8 9
"))
## a matrix
is.na(df)
## logical for selecting rows that are all NA
apply(df, 1, function(x) all(is.na(x)))
## one liner
df[!apply(df, 1, function(x) all(is.na(x))), ]

Related

Find the index of columns containing more than 5 NA values

I want to subset a dataframe and extract only the columns that contain 5 or more NA values.
data.frame(A = rep(1, 10), B = c(rep(2,5), rep(3,5)), D = rep(5, 10), E = c(rep(1,2), rep(NA,6), rep(6,2)), F = c(rep(NA,2), rep(2,8)))
A B D E F
1 1 2 5 1 NA
2 1 2 5 1 NA
3 1 2 5 NA 2
4 1 2 5 NA 2
5 1 2 5 NA 2
6 1 3 5 NA 2
7 1 3 5 NA 2
8 1 3 5 NA 2
9 1 3 5 6 2
10 1 3 5 6 2
So in this example I want to have the index of the column "E".
My original dataset has about 3000 columns, so speed is more or less important.
I have been trying to do this with sum(is.na) and filter_if(any_vars) but all to no avail..
Using ColSums with is.na
names(df)[colSums(is.na(df))>5]
[1] "E"
We can use colSums on logical matrix (is.na(df1)), get the index with which and extract the names
names(which(colSums(is.na(df1)) >= 5))
#[1] "E"
which(unlist(lapply(df, function(x) sum(is.na(x)) > 5)))
4

Using combn in R to create a matrix of all possible combinations

This is related to a homework question I am working on. I need to perform a data manipulation of a few vectors into a matrix, and the TA suggested using the combn function:
# what I'm starting with
a = c(1, 2)
b = c(NA, 4, 5)
c = c(7, 8)
# what I need to get
my_matrix
a b c
1 NA 7
1 NA 8
1 4 7
1 4 8
1 5 7
1 5 8
2 NA 7
2 NA 8
2 4 7
2 4 8
2 5 7
2 5 8
my_matrix is a matrix with all possibly combinations of the elements in a, b and c, with column names a, b, and c. I understand what combn() is doing, but not exactly sure how to convert it into the matrix shown above?
Thanks in advance for any help!
expand.grid, mentioned in the comments to the question, is the better and much easier way to do it. But you can use combn too
#STEP 1: Get all combinations of elements of 'a', 'b', and 'c' taken 3 at a time
temp = t(combn(c(a, b, c), 3))
# STEP 2: In the first column, only keep values present in 'a'
#Repeat STEP 2 for second column with 'b', third column with 'c'
#Use setNames to rename the column names as you want
ans = setNames(data.frame(temp[temp[,1] %in% a & temp[,2] %in% b & temp[,3] %in% c,]),
nm = c('a','b','c'))
ans
# a b c
#1 1 NA 7
#2 1 NA 8
#3 1 4 7
#4 1 4 8
#5 1 5 7
#6 1 5 8
#7 2 NA 7
#8 2 NA 8
#9 2 4 7
#10 2 4 8
#11 2 5 7
#12 2 5 8

Populating a data frame with corresponding values from another

I have a data frame containing values read in from an experiment with independent variables A and B which doesn't cover all possible permutations of A and B. I need to create a data frame which does contain all permutations, with zeros in those places where that particular pair of values isn't present in the data.
To create some sample data,
interactions <- unique(data.frame(A = sample(1:5, 10, replace=TRUE),
B = sample(1:5, 10, replace=TRUE)))
interactions <- interactions[interactions$A < interactions$B, ]
interactions$val <- runif(nrow(interactions))
possible.interactions <- data.frame(t(combn(1:5, 2)))
names(possible.interactions) <- c('A', 'B')
which creates
interactions
A B val
1 5 0.6881106
1 2 0.5286560
2 4 0.5026426
and
possible.interactions
A B
1 2
1 3
1 4
1 5
2 3
2 4
2 5
3 4
3 5
4 5
and I want to output
A B val
1 2 NA
1 3 0.5286560
1 4 NA
1 5 0.6881106
2 3 NA
2 4 0.5026426
2 5 NA
3 4 NA
3 5 NA
4 5 NA
What is the fastest way to do this?
Here is a base solution that is much faster (~10x) than merge:
possible.interactions$val <- interactions$val[
match(
do.call(paste, possible.interactions),
do.call(paste, interactions[1:2])
) ]
This produces (note, different to what you expect b/c you didn't set seed):
# A B val
# 1 1 2 0.59809242
# 2 1 3 0.92861520
# 3 1 4 0.64279549
# 4 1 5 NA
# 5 2 3 0.03554058
# 6 2 4 NA
# 7 2 5 NA
# 8 3 4 NA
# 9 3 5 NA
# 10 4 5 NA
This assumes A & B do not contain spaces and that interactions has no duplicate A-B pairs (will always match to first).
And the data.table version:
possible.DT <- data.table(possible.interactions)
DT <- data.table(interactions, key=c("A", "B"))
DT[possible.DT]
Though this is only worthwhile if your tables are large or you have uses for other benefits of data.table. I've found speed to be comparable to match in simple cases if you include the overhead of creating and keying the tables. I'm sure there are cases where data.table is much faster, especially if you key once and then use that key a lot.
For completeness, here is the merge version:
merge(possible.interactions, interactions, all.x=T)
If order is important to you, I recommend using join from the plyr package. As opposed to merge which does not provide an intuitive ordering when there are unmatched elements.
library(plyr)
join(interactions,possible.interactions,type="right")
Joining by: A, B
A B val
1 1 2 NA
2 1 3 NA
3 1 4 0.007602083
4 1 5 0.853415110
5 2 3 NA
6 2 4 0.321098658
7 2 5 NA
8 3 4 NA
9 3 5 NA
10 4 5 NA

How to set the whole column to the index of the row

How to set the whole column to the index of the row?
a b c d
1 1 3 3 1
2 2 3 4 5
3 4 5 6 7
4 6 5 7 8
becomes
b c d
1 3 3 1
2 3 4 5
4 5 6 7
6 5 7 8
I have tried to use xts(), but the column type changed to character but not numeric
As explained in the comments here is a solution:
row.names(df) <- df$a
df <- df[-1] # To eliminate the column as "it is" now in the row.names
If you want to change to numeric, you can apply this function to convert it to numeric:
df <- apply(df, 2, as.numeric)

Retrieving subset of a data frame by finding entries with NA in specific columns

Suppose we had a data frame with NA values like so,
>data
A B C D
1 3 NA 4
2 1 3 4
NA 3 3 5
4 2 NA NA
2 NA 4 3
1 1 1 2
I wish to know a general method for retrieving the subset of data with NA values in C or A. So the output should be,
A B C D
1 3 NA 4
NA 3 3 5
4 2 NA NA
I tried using the subset command like so, subset(data, A==NA | C==NA), but it didn't work. Any ideas?
A very handy function for these sort of things is complete.cases. It checks row-wise for NA and if any returns FALSE. If there are no NAs, returns TRUE.
So, you need to subset just the two columns of your data and then use complete.cases(.) and negate it and subset those rows back from your original data, as follows:
# assuming your data is in 'df'
df[!complete.cases(df[, c("A", "C")]), ]
# A B C D
# 1 1 3 NA 4
# 3 NA 3 3 5
# 4 4 2 NA NA
Here is one possibility:
# Read your data
data <- read.table(text="
A B C D
1 3 NA 4
2 1 3 4
NA 3 3 5
4 2 NA NA
2 NA 4 3
1 1 1 2",header=T,sep="")
# Now subset your data
subset(data, is.na(C) | is.na(A))
A B C D
1 1 3 NA 4
3 NA 3 3 5
4 4 2 NA NA

Resources