How to exclude missing data in specific columns in R - r

I have a df with 15,105 rows and 127 columns. I'd like to exclude some specific colunms' rows that have NA. I´m using the following command:
wave1b <- na.omit(wave1, cols=c("Bx", "Deq", "Gef", "Has", "Pla", "Ty"))
However, when I run it it returns with 19 rows only, when it was expected to return with 14,561 rows (if it should have excluded only the NA in those specific colunms requested). I'm afirming this, cause I did a subset on the df in order to test the accuracy of the missing deletion.
Does anyone could help me solving this issue? Thank you!

I think this code is not efficient but it could work:
df <- data.frame(A = rep(NA,3), B = c(NA,2,3),C=c(1,NA,2))
df
A B C
1 NA NA 1
2 NA 2 NA
3 NA 3 2
It removes only the rows which have missing values for the columns B and C:
df[-which(is.na(df$B)|is.na(df$C)),]
A B C
3 NA 3 2

You can use complete.cases
> df[complete.cases(df[, -1]), ]
A B C
3 NA 3 2

Related

Getting wrong result while removing all NA value columns in R

I am getting wrong result while removing all NA value column in R
data file : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
trainingData <- read.csv("D:\\pml-training.csv",na.strings = c("NA","", "#DIV/0!"))
Now I want to remove all the column which only has NA's
Approach 1: here I mean read all the column which has more than 0 sum and not NA
aa <- trainingData[colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
154 columns
Approach 2: As per this query, it will give all the columns which is NA and sum = 0, but it is giving the result of column which does not have NA and gives expected result
bb <- trainingData[,colSums(is.na(trainingData)) == 0]
length(colnames(bb))
60 columns (expected)
Can someone please help me to understand what is wrong in first statement and what is right in second one
aa <- trainingData[,colSums(!is.na(trainingData)) > 0]
length(colnames(aa))
You convert the dataframe to a boolean dataframe with !is.na(trainingData), and find all columns where there is more than one TRUE (so non-NA) in the column. So this returns all columns that have at least one non-NA value, which seem to be all but 6 columns.
bb <- trainingData[colSums(is.na(trainingData)) == 0]
length(colnames(bb))
You convert the dataframe to boolean with is.na(trainingData) and return all values where there is no TRUE (no NA) in the column. This returns all columns where there are no missing values (i.e. no NA's).
Example as requested in comment:
df = data.frame(a=c(1,2,3),b=c(NA,1,1),c=c(NA,NA,NA))
bb <- df[colSums(is.na(df)) == 0]
> df
a b c
1 1 NA NA
2 2 1 NA
3 3 1 NA
> bb
a
1 1
2 2
3 3
So the statements are in fact different. If you want to remove all columns that are only NA's, you should use the first statement. Hope this helps.

Merge two columns with Factors and NAs

I have two columns with factors, I wanted to merge. As I have a lot of observations I wonder if there's a quick option with dplyr or tidyr.
Col1 Col2
A NA
B NA
NA C
A A
NA B
A NA
B B
I know that this shouldn't be difficult but I'm clearly missing something here. I've tried several options but as I want to keep the factors, all the ones I know didn't work.
Note that when both columns have a result, they will always be the same. But this is part of the data characteristics I have.
I expect to have something such as:
Col1 Col2 Col3
A NA A
B NA B
NA C C
A A A
NA B B
A NA A
B B B
I think this should do it using dplyr:
library('dplyr')
dat %>%
mutate(Col3 = if_else(is.na(Col1),Col2, Col1))

Conditional replacement of NAs in two dataframes R

Probably simple but tricky question especially for larger data sets. Given two dataframes (df1,df2) of equal dimensions as below:
head(df1)
a b c
1 0.8569720 0.45839112 NA
2 0.7789126 0.36591578 NA
3 0.6901663 0.88095485 NA
4 0.7705756 0.54775807 NA
5 0.1743111 0.89087819 NA
6 0.5812786 0.04361905 NA
and
head(df2)
a b c
1 0.21210312 0.7670091 NA
2 0.19767464 0.3050934 1
3 0.08982958 0.4453491 2
4 0.75196925 0.6745908 3
5 0.73216793 0.6418483 4
6 0.73640209 0.7448011 5
How can one find all columns where if(all(is.na(df1)), in this case c, go to df2and set all values in matching column (c) to NAs.
Desired output
head(df3)
a b c
1 0.21210312 0.7670091 NA
2 0.19767464 0.3050934 NA
3 0.08982958 0.4453491 NA
4 0.75196925 0.6745908 NA
5 0.73216793 0.6418483 NA
6 0.73640209 0.7448011 NA
My actual dataframes have more than 140000 columns.
We can use colSums on the negated logical matrix (is.na(df1)), negate (!) thevector` so that 0 non-NA elements becomes TRUE and all others FALSE, use this to subset the columns of 'df2' and assign it to NA.
df2[!colSums(!is.na(df1))] <- NA
df2
# a b c
#1 0.21210312 0.7670091 NA
#2 0.19767464 0.3050934 NA
#3 0.08982958 0.4453491 NA
#4 0.75196925 0.6745908 NA
#5 0.73216793 0.6418483 NA
#6 0.73640209 0.7448011 NA
Or another option is to loop over the columns and check whether all the elements are NA to create a logical vector for subsetting the columns of 'df2' and assigning it to NA
df2[sapply(df1, function(x) all(is.na(x)))] <- NA
If these are big datasets, another option would be set from data.table (should be more efficient as this does the assignment in place)
library(data.table)
setDT(df2)
j1 <- which(sapply(df1, function(x) all(is.na(x))))
for(j in j1){
set(df2, i = NULL, j = j, value = NA)
}

Delete column with NAs in first row

If I have a dataframe like so
a <- c(NA,1,2,NA,4)
b <- c(6,7,8,9,10)
c <- c(NA,12,13,14,15)
d <- c(16,NA,18,NA,20)
df <- data.frame(a,b,c,d)
How can I delete columns "a" and "c" by asking R to delete those columns that contain an NA in the first row?
My actual dataset is much bigger, and this is only by way of a reproducible example.
Please note that this isn't the same as asking to delete columns with any NAs in it. My columns may have other NA values in it. I'm looking to delete just the ones with an NA in the first row.
You can use a vector of booleans indicating wether the first row is missing in this case.
res <- df[,!is.na(df[1,])]
> res
b d
1 6 16
2 7 NA
3 8 18
4 9 NA
5 10 20

merge.data.table with all=True introduces NA row. Is this correct?

Doing a merge between a populated data.table and another one that is empty introduces one NA row in the resulting data.table:
a = data.table(c=c(1,2),key='c')
b = data.table(c=3,key='c')
b=b[c!=3]
b
# Empty data.table (0 rows) of 1 col: c
merge(a,b,all=T)
# c
# 1: NA
# 2: 1
# 3: 2
Why? I expected that it would return only the rows of data.table a, as it does with merge.data.frame:
> merge.data.frame(a,b,all=T,by='c')
# c
#1 1
#2 2
The example in the question is far too simple to show the problem, hence the confusion and discussion. Using two one-column data.tables isn't enough to show what merge does!
Here's a better example :
> a = data.table(P=1:2,Q=3:4,key='P')
> b = data.table(P=2:3,R=5:6,key='P')
> a
P Q
1: 1 3
2: 2 4
> b
P R
1: 2 5
2: 3 6
> merge(a,b) # correct
P Q R
1: 2 4 5
> merge(a,b,all=TRUE) # correct.
P Q R
1: 1 3 NA
2: 2 4 5
3: 3 NA 6
> merge(a,b[0],all=TRUE) # incorrect result when y is empty, agreed
P Q R
1: NA NA NA
2: NA NA NA
3: 1 3 NA
4: 2 4 NA
> merge.data.frame(a,b[0],all=TRUE) # correct
P Q R
1 1 3 NA
2 2 4 NA
Ricardo got to the bottom of this and fixed it in v1.8.9. From NEWS :
merge no longer returns spurious NA row(s) when y is empty and
all.y=TRUE (or all=TRUE), #2633. Thanks
to Vinicius Almendra for reporting. Test added.
all : logical; all = TRUE is shorthand to save setting both all.x = TRUE and all.y = TRUE.
all.x : logical; if TRUE, then extra rows will be added to the output, one for each row in
x that has no matching row in y. These rows will have ’NA’s in those columns
that are usually filled with values from y. The default is FALSE, so that only rows
with data from both x and y are included in the output.
all.y : logical; analogous to all.x above.
This is taken from data.table documentation. For more, look at the description of the arguments for merge function there.
I think this answers your question.
Given you define a and b in your way. The simple usage of rbind(a,b) would return only the rows of a.
However if you want to merge NULL data table b with some other non-empty data table a, there's different approach. I had similar problem when I had to merge different data tables within different loops. I used this workaround.
#some loop that returns data.table named a
#another loop starts
if(all.equal(a,b<-data.table())==TRUE){
b<-a
next
}
merge(a,b,c("Factor1","Factor2"))
That helped me, maybe it will help you too.
That's to be expected, as for merge.data.frame all=T is a full outer join, so you get all keys of both tables see about merge

Resources