Find a row in a data frame including columns with NA's? - r

I have a translation table with 67 columns and I get an input of 67 columns.
My goal is to check if I can find it within this translation table.
To be clear, 67 columns build a key, and additional 10 are the actual values for this key.
Please advise how can I quickly find it if some of the columns (variables) in the input can be with NA value?
small example:
input:
a b c d e
1 9 "r" NA NA
translation table:
a b c d e
5 NA NA NA 9
6 9 "o" 4 3
1 9 "r" NA NA

We can use a paste method to create a string for each row in both datasets and then with %in% get a logical vector indicating the string is contained in the other vector. Wrapping with which gives the position of the rows where this is TRUE
which(do.call(paste, df2) %in% do.call(paste, df1))

Related

inserting specific elements into vector using R

Let's say I have two vectors, one that includes NA values, and another that is the length of the first vector after dropping the NA values. I am looking to insert the NA values from the first vector into the second vector, while keeping the position of the NA values the same.
a<-c(1,2,3,6,5,NA,4,5,NA,45,6,NA)
b<-c(1,2,4,3,6,5,7,8,40)
This can be done by concatenating each component, but this seems extremely tedious, especially since my data are much more complicated than the above example. Something like
b[which(is.na(a))]<-NA
is what I am looking for, but this of course replaces elements instead of inserting elements like I want. I am at a loss for this even though it seems relatively simple.
Create a NA vector of the same length as 'a' and then replace based on the non NA elements in 'a'
b <- replace(rep(NA, length(a)), !is.na(a), b)
-output
b
#[1] 1 2 4 3 6 NA 5 7 NA 8 40 NA
Or more compactly, do the replace on 'a'
replace(a, !is.na(a), b)
[1] 1 2 4 3 6 NA 5 7 NA 8 40 NA

Convert entire data frame into one long column (vector)

I want to turn the entire content of a numeric (incl. NA's) data frame into one column. What would be the smartest way of achieving the following?
>df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
>df
C1 C2 C3
1 1 4 NA
2 NA 5 8
3 3 NA 9
>x <- mysterious_operation(df)
>x
[1] 1 NA 3 4 5 NA NA 8 9
I want to calculate the mean of this vector, so ideally I'd want to remove the NA's within the mysterious_operation - the data frame I'm working on is very large so it will probably be a good idea.
Here's a couple ways with purrr:
# using invoke, a wrapper around do.call
purrr::invoke(c, df, use.names = FALSE)
# similar to unlist, reduce list of lists to a single vector
purrr::flatten_dbl(df)
Both return:
[1] 1 NA 3 4 5 NA NA 8 9
The mysterious operation you are looking for is called unlist:
> df <- data.frame(C1=c(1,NA,3),C2=c(4,5,NA),C3=c(NA,8,9))
> unlist(df, use.names = F)
[1] 1 NA 3 4 5 NA NA 8 9
We can use unlist and create a single column data.frame
df1 <- data.frame(col =unlist(df))
Just for fun. Of course unlist is the most appropriate function.
alternative
stack(df)[,1]
alternative
do.call(c,df)
do.call(c,c(df,use.names=F)) #unnamed version
Maybe they are more mysterious.

How can I find out the names of columns that satisfy a condition in a data frame

I wish to know (by name) which columns in my data frame satisfy a particular condition. For example, if I was looking for the names of any columns that contained more than 3 NA, how could I proceed?
>frame
m n o p
1 0 NA NA NA
2 0 2 2 2
3 0 NA NA NA
4 0 NA NA 1
5 0 NA NA NA
6 0 1 2 3
> for (i in frame){
na <- is.na(i)
as.numeric(na)
total<-sum(na)
if(total>3){
print (i) }}
[1] NA 2 NA NA NA 1
[2] NA 2 NA NA NA 2
So this actually succeeds in evaluating which columns satisfy the condition, however, it does not display the column name. Perhaps subsetting the columns which interest me would be another way to do it, but I'm not sure how to solve it that way either. Plus I'd prefer to know if there's a way to just get the names directly.
I'll appreciate any input.
We can use colSums on a logical matrix (is.na(frame)), check whether it is greater than 3 to get a logical vector and then subset the names of 'frame' based on that.
names(frame)[colSums(is.na(frame))>3]
#[1] "n" "o"
If we are using dplyr, one way is
library(dplyr)
frame %>%
summarise_each(funs(sum(is.na(.))>3)) %>%
unlist() %>%
names(.)[.]
#[1] "n" "o"

Add columns in vector but not in df

I am trying to do the following and was wondering if there is an easier way to use dplyr to achieve this (I'm sure there is):
I want to compare the columns of a dataframe to a vector of names, and if the df does not contain a column corresponding to one of the names in the name vector, add that column to the df and populate its values with NAs.
E.g., in the MWE below:
df <- data.frame(cbind(c(1:6),c(11:16),c(10:15)))
colnames(df) <- c("A","B","C")
names <- c("A","B","C","D","E")
how do I use dplyr to create the two columns D and E (which are in names, but not in df) and populate it with NAs?
No need in dplyr, it's just a basic operation in base R. (Btw, try avoiding overriding built in functions such as names in the future. The reason names still works is because R looks in the base package NAMESPACE file instead in the global environment, but this is still a bad practice.)
df[setdiff(names, names(df))] <- NA
df
# A B C D E
# 1 1 11 10 NA NA
# 2 2 12 11 NA NA
# 3 3 13 12 NA NA
# 4 4 14 13 NA NA
# 5 5 15 14 NA NA
# 6 6 16 15 NA NA

How to delete rows from a dataframe that contain n*NA

I have a number of large datasets with ~10 columns, and ~200000 rows. Not all columns contain values for each row, although at least one column must contain a value for the row to be present, I would like to set a threshold for how many NAs are allowed in a row.
My Dataframe looks something like this:
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
C NA 9 4 NA 4 8 4 NA 5 NA
D 2 2 6 8 4 NA 3 7 1 32
And I would like to be able to delete the rows that contain more than 2 cells containing NA to get
ID q r s t u v w x y z
A 1 5 NA 3 8 9 NA 8 6 4
B 5 NA 4 6 1 9 7 4 9 3
D 2 2 6 8 4 NA 3 7 1 32
complete.cases removes all rows containing any NA, and I know one can delete rows that contain NA in certain columns but is there a way to modify it so that it is non-specific about which columns contain NA, but how many of the total do?
Alternatively, this dataframe is generated by merging several dataframes using
file1<-read.delim("~/file1.txt")
file2<-read.delim(file=args[1])
file1<-merge(file1,file2,by="chr.pos",all=TRUE)
Perhaps the merge function could be altered?
Thanks
Use rowSums. To remove rows from a data frame (df) that contain precisely n NA values:
df <- df[rowSums(is.na(df)) != n, ]
or to remove rows that contain n or more NA values:
df <- df[rowSums(is.na(df)) < n, ]
in both cases of course replacing n with the number that's required
If dat is the name of your data.frame the following will return what you're looking for:
keep <- rowSums(is.na(dat)) < 2
dat <- dat[keep, ]
What this is doing:
is.na(dat)
# returns a matrix of T/F
# note that when adding logicals
# T == 1, and F == 0
rowSums(.)
# quickly computes the total per row
# since your task is to identify the
# rows with a certain number of NA's
rowSums(.) < 2
# for each row, determine if the sum
# (which is the number of NAs) is less
# than 2 or not. Returns T/F accordingly
We use the output of this last statement to
identify which rows to keep. Note that it is not necessary to actually store this last logical.
If d is your data frame, try this:
d <- d[rowSums(is.na(d)) < 2,]
This will return a dataset where at most two values per row are missing:
dfrm[ apply(dfrm, 1, function(r) sum(is.na(x)) <= 2 ) , ]

Resources