when to use na.omit versus complete.cases - r

I have following code comparing na.omit and complete.cases:
> mydf
AA BB
1 2 2
2 NA 5
3 6 8
4 5 NA
5 9 6
6 NA 1
>
>
> na.omit(mydf)
AA BB
1 2 2
3 6 8
5 9 6
>
> mydf[complete.cases(mydf),]
AA BB
1 2 2
3 6 8
5 9 6
>
> str(na.omit(mydf))
'data.frame': 3 obs. of 2 variables:
$ AA: int 2 6 9
$ BB: int 2 8 6
- attr(*, "na.action")=Class 'omit' Named int [1:3] 2 4 6
.. ..- attr(*, "names")= chr [1:3] "2" "4" "6"
>
>
> str(mydf[complete.cases(mydf),])
'data.frame': 3 obs. of 2 variables:
$ AA: int 2 6 9
$ BB: int 2 8 6
>
> identical(na.omit(mydf), mydf[complete.cases(mydf),])
[1] FALSE
Are there any situations where one or the other should be used or effectively they are the same?

It is true that na.omit and complete.cases are functionally the same when complete.cases is applied to all columns of your object (e.g. data.frame):
R> all.equal(na.omit(mydf),mydf[complete.cases(mydf),],check.attributes=F)
[1] TRUE
But I see two fundamental differences between these two functions (there may very well be additional differences). First, na.omit adds an na.action attribute to the object, providing information about how the data was modified WRT missing values. I imagine a trivial use case for this as something like:
foo <- function(data) {
data <- na.omit(data)
n <- length(attributes(na.omit(data))$row.names)
message(sprintf("Note: %i rows removed due to missing values.",n))
# do something with data
}
##
R> foo(mydf)
Note: 3 rows removed due to missing values.
where we provide the user with some relevant information. I'm sure a more creative person could (and probably has) find (found) better uses of the na.action attribute, but you get the point.
Second, complete.cases allows for partial manipulation of missing values, e.g.
R> mydf[complete.cases(mydf[,1]),]
AA BB
1 2 2
3 6 8
4 5 NA
5 9 6
Depending on what your variables represent, you may feel comfortable imputing values for column BB, but not for column AA, so using complete.cases like this allows you finer control.

Related

what is the difference between df[1] and df[,1] (in dataframes)

i've noticed they give the same result except that for df[1] it gives the column in the shape of a dataframe while df[,1] returns a vector.Also, i've noticed they give exactly the same result in tibbles. is that all it is to it ?
The "[" function has (at least) two different forms. When used on a dataframe which is a special form of a list with two arguments it returns the contents of the rows and columns specified columns. It does have an optional argument "drop" whose default is TRUE. If it is set to FALSE, then you get the subset as a dataframeWhen used with one argument, it returns the columns itself without loss of the "data.frame" class attribute. The columns are actually lists in their own right.
The other extraction function, "[[" also returns the contents only.
dat <- data.frame(A=1:10,B=letters[1:10])
> str(dat[1:5,])
'data.frame': 5 obs. of 2 variables:
$ A: int 1 2 3 4 5
$ B: Factor w/ 10 levels "a","b","c","d",..: 1 2 3 4 5
> str(dat[1:5,1])
int [1:5] 1 2 3 4 5
> str(dat[1])
'data.frame': 10 obs. of 1 variable:
$ A: int 1 2 3 4 5 6 7 8 9 10
> str(dat[[1]])
int [1:10] 1 2 3 4 5 6 7 8 9 10
> str(dat[,1,drop=FALSE])
'data.frame': 10 obs. of 1 variable:
$ A: int 1 2 3 4 5 6 7 8 9 10

rename a matrix column which as no initial names with dplyr

I'm trying to rename the columns of a matrix that has no names in dplyr :
set.seed(1234)
v1 <- table(round(runif(50,0,10)))
v2 <- table(round(runif(50,0,10)))
library(dplyr)
bind_rows(v1,v2) %>%
t
[,1] [,2]
0 3 4
1 1 9
2 8 6
3 11 7
5 7 8
6 7 1
7 3 4
8 6 3
9 3 6
10 1 NA
4 NA 2
I usually use rename for that with the form rename(new_name=old_name) however because there is no old_name it doesn't work. I've tried:
rename("v1","v2")
rename(c("v1","v2")
rename(v1=1, v2=2)
rename(v1=[,1],v2=[,v2])
rename(v1="[,1]",v2="[,v2]")
rename_(.dots = c("v1","v2"))
setNames(c("v1","v2"))
none of these works.
I know the base R way to do it (colnames(obj) <- c("v1","v2")) but I'm specifically looking for a dplyrway to do it.
This one with magrittr:
library(dplyr)
bind_rows(v1,v2) %>%
t %>%
magrittr::set_colnames(c("new1", "new2"))
In order to use rename you need to have some sort of a list (like a data frame or a tibble). So you can do two things. You either convert to tibble and use rename or use colnames and leave the structure as is, i.e.
new_d <- bind_rows(v1,v2) %>%
t() %>%
as.tibble() %>%
rename('A' = 'V1', 'B' = 'V2')
#where
str(new_d)
#Classes ‘tbl_df’, ‘tbl’ and 'data.frame': 11 obs. of 2 variables:
# $ A: int 3 1 8 11 7 7 3 6 3 1 ...
# $ B: int 4 9 6 7 8 1 4 3 6 NA ...
Or
new_d1 <- bind_rows(v1,v2) %>%
t() %>%
`colnames<-`(c('A', 'B'))
#where
str(new_d1)
# int [1:11, 1:2] 3 1 8 11 7 7 3 6 3 1 ...
# - attr(*, "dimnames")=List of 2
# ..$ : chr [1:11] "0" "1" "2" "3" ...
# ..$ : chr [1:2] "A" "B"

Aggregate command in R to combine rows based on unique ID - output data structure?

I'm sure there's a super-easy answer to this. I am trying to combine ratings on subjects based on their unique ID. Here is a test dataset (called Aggregate_Test)I created, where the ID is unique to the subject, and the StaticScore was done by different raters:
ID StaticScore
1 6
2 7
1 5
2 6
3 7
4 8
3 4
4 5
After reading other posts carefully, I used aggregate to create the following dataset with new columns:
StaticAggregate<-aggregate(StaticScore ~ ID, Aggregate_Test, c)
> StaticAggregate
ID StaticScore.1 StaticScore.2
1 1 6 5
2 2 7 6
3 3 7 4
4 4 8 5
This data frame has the following str:
> str(StaticAggregate)
'data.frame': 4 obs. of 2 variables:
$ ID : num 1 2 3 4
$ StaticScore: num [1:4, 1:2] 6 7 7 8 5 6 4 5
If I try to create a new variable by subtracting StaticScore.1 from StaticScore.2, I get the following error:
Staticdiff<-StaticScore.1-StaticScore.2
Error: object 'StaticScore.1' not found
So, please help me - what is this data structure created by aggregate? A matrix? How could I convert StaticScore.1 and StaticScore.2 to separate variables, or barring that, what is the notation to subtract one from the other to create a new variable?
We can do a dcast to create a wide format from long and subtract those columns to create the 'StaticDiff'
library(data.table)
dcast(setDT(Aggregate_Test), ID~paste0("StaticScore", rowid(ID)), value.var="StaticScore"
)[, StaticDiff := StaticScore1 - StaticScore2]
Regarding the specific question about the aggregate behavior, we are just concatenating (c) the 'StaticScore' by 'ID'. The default behavior is to create a matrix column in aggregate
StaticAggregate<-aggregate(StaticScore ~ ID, Aggregate_Test, c)
This can be checked by looking at the str(StaticAggregate)
str(StaticAggregate)
#'data.frame': 4 obs. of 2 variables:
#$ ID : int 1 2 3 4
#$ StaticScore: int [1:4, 1:2] 6 7 7 8 5 6 4 5
How do we change it to normal columns?
It can be done with do.call(data.frame
StaticAggregate <- do.call(data.frame, StaticAggregate)
Check the str again
str(StaticAggregate)
#'data.frame': 4 obs. of 3 variables:
# $ ID : int 1 2 3 4
# $ StaticScore.1: int 6 7 7 8
# $ StaticScore.2: int 5 6 4 5
Now, we can do the calcuation as showed in the OP's post
StaticAggregate$Staticdiff <- with(StaticAggregate, StaticScore.1-StaticScore.2)
StaticAggregate
# ID StaticScore.1 StaticScore.2 Staticdiff
#1 1 6 5 1
#2 2 7 6 1
#3 3 7 4 3
#4 4 8 5 3
As the str output shown in the question indicates, StaticAggregate is a two column data.frame whose second column is a two column matrix, StaticScore. We can display the matrix like this:
StaticAggregate$StaticScore
## [,1] [,2]
## [1,] 6 5
## [2,] 7 6
## [3,] 7 4
## [4,] 8 5
To create a new column with the difference:
transform(StaticAggregate, diff = StaticScore[, 1] - StaticScore[, 2])
## ID StaticScore.1 StaticScore.2 diff
## 1 1 6 5 1
## 2 2 7 6 1
## 3 3 7 4 3
## 4 4 8 5 3
Note that there are no columns in StaticAggregate or in StaticAggregate$StaticScore named StaticScore.1 and StaticScore.2. StaticScore.1 in the heading of the data.frame print output just denotes the first column of the StaticScore matrix.
The reason that the matrix has no column names is that the aggregate function c does not produce them. If we change the original aggregate to this then they would have names:
StaticAggregate2 <- aggregate(StaticScore ~ ID, Aggregate_Test, setNames, c("A", "B"))
StaticAggregate2
## ID StaticScore.A StaticScore.B
## 1 1 6 5
## 2 2 7 6
## 3 3 7 4
## 4 4 8 5
Now we can write this using the column names of the matrix:
StaticAggregate2$StaticScore[, "A"]
## [1] 6 7 7 8
StaticAggregate2$StaticScore[, "B"]
## [1] 5 6 4 5
Note that there is a significant advantage of the way R's aggregate works as it allows simpler access to the results -- the kth column of the matrix is the kth result of the aggregate function. This is in contrast to having the k+1st column of the data.frame representing the kth result of the aggregate function. This may not seem like much of a simplification here but for more complex problems it can be a significant simplification if you need to access the statistics matrix. Of course, you can always flatten it to 3 columns if you want
do.call(data.frame, StaticAggregate)
but once you think about it for a while you may find that the structure it provides is actually more convenient.

the issues of column type in read.table result and their transformation

I read a csv file as follows
dataBU<-read.table("data1.csv",sep=",",header=T,stringsAsFactors=FALSE)
the data looks like as follows
id q1 q2 q3 q4
AB 1 1 0 1
AJ 0 2 3 0
AM 5 4 2 0
RA 2 1 10 0
BS 5 0 0 1
Then I would like to keep the last four columns, thus I have
dataBu1<-dataBu[,2:5]
But when I check the data, I found
> dataBu1[1,1]
[1] 1
> dataBu1[1,2]
[1] "1"
The first column and the second column are of different types. The first column is of numeric type and the second column is of character type. I assume both of them should be of numeric type. But it turns out that it is not true. What causes this kind of scenario and how to transform the second column into the type of numeric.
Suppose if you have character values in a numeric column, one way to convert it back to numeric is to use as.numeric
set.seed(42)
dataBu1 <- data.frame(q1=sample(1:10,20,replace=TRUE),
q2=sample(c('', 5:15,'q2'),20,replace=TRUE), stringsAsFactors=FALSE)
as.numeric(dataBu1[,2]) #replace all the character values with NA but it issues a warning message
#[1] 14 5 15 15 NA 10 8 14 9 14 12 13 8 12 NA 13 NA 6 14 11
#Warning message:
#NAs introduced by coercion
for multiple columns (assuming that you read the dataset with stringsAsFactors=FALSE)
dataBu1[] <-lapply(dataBu1, as.numeric)
str(dataBu1)
#'data.frame': 20 obs. of 2 variables:
#$ q1: num 10 10 3 9 7 6 8 2 7 8 ...
#$ q2: num 15 5 NA NA 5 10 9 15 9 14 ...
Or without getting the warning message
dataBu1[] <- lapply(dataBu1, function(x)
as.numeric(replace(x, !grepl("^[0-9]+$", x), NA)))
Update
I guess you are asking to find the index of non-numeric elements after it was read using read.table.
lapply(dataBu1, function(x) which(!grepl("^[0-9]+$", x)))
#$q1
#integer(0)
#$q2
#[1] 3 4 15 17

Subset columns based on certain columns missing value

My dataset is pretty big. I have about 2,000 variables and 1,000 observations.
I want to run a model for each variable using other variables.
To do so, I need to drop variables which have missing values where the dependent variable doesn't have.
I meant that for instance, for variable "A" I need to drop variable C and D because those have missing values where variable A doesn't have. for variable "C" I can keep variable "D".
data <- read.table(text="
A B C D
1 3 9 4
2 1 3 4
NA NA 3 5
4 2 NA NA
2 5 4 3
1 1 1 2",header=T,sep="")
I think I need to make a loop to go through each variable.
I think this gets what you need:
for (i in 1:ncol(data)) {
# filter out rows with NA's in on column 'i'
# which is the column we currently care about
tmp <- data[!is.na(data[,i]),]
# now column 'i' has no NA values, so remove other columns
# that have NAs in them from the data frame
tmp <- tmp[sapply(tmp, function(x) !any(is.na(x)))]
#run your model on 'tmp'
}
For each iteration of i, the tmp data frame looks like:
'data.frame': 5 obs. of 2 variables:
$ A: int 1 2 4 2 1
$ B: int 3 1 2 5 1
'data.frame': 5 obs. of 2 variables:
$ A: int 1 2 4 2 1
$ B: int 3 1 2 5 1
'data.frame': 4 obs. of 2 variables:
$ C: int 3 3 4 1
$ D: int 4 5 3 2
'data.frame': 5 obs. of 1 variable:
$ D: int 4 4 5 3 2
I'll provide a way to get the usable vadiables for each column you choose:
getVars <- function(data, col){
tmp<-!sapply(data[!is.na(data[[col]]),], function(x) { any(is.na(x)) })
names(data)[tmp & names(data) != col]
}
PS: I'm on my phone so I didn't test the above nor had the chance for a good code styling.
EDIT: Styling fixed!

Resources