I am a grad student using R and have been reading the other Stack Overflow answers regarding removing rows that contain NA from dataframes. I have tried both na.omit and complete.cases. When using both it shows that the rows with NA have been removed, but when I write summary(data.frame) it still includes the NAs. Are the rows with NA actually removed or am I doing this wrong?
na.omit(Perios)
summary(Perios)
Perios[complete.cases(Perios),]
summary(Perios)
The error is that you actually didn't assign the output from na.omit !
Perios <- na.omit(Perios)
If you know which column the NAs occur in, then you can just do
Perios[!is.na(Perios$Periostitis),]
or more generally:
Perios[!is.na(Perios$colA) & !is.na(Perios$colD) & ... ,]
Then as a general safety tip for R, throw in an na.fail to assert it worked:
na.fail(Perios) # trust, but verify! Die Paranoia ist gesund.
is.na is not the proper function. You want complete.cases and you want complete.cases which is the equivalent of function(x) apply(is.na(x), 1, all) or na.omit to filter the data:
That is, you want all rows where there are no NA values.
< x <- data.frame(a=c(1,2,NA), b=c(3,NA,NA))
> x
a b
1 1 3
2 2 NA
3 NA NA
> x[complete.cases(x),]
a b
1 1 3
> na.omit(x)
a b
1 1 3
Then this is assigned back to x to save the data.
complete.cases returns a vector, one element per row of the input data frame. On the other hand, is.na returns a matrix. This is not appropriate for returning complete cases, but can return all non-NA values as a vector:
> is.na(x)
a b
[1,] FALSE FALSE
[2,] FALSE TRUE
[3,] TRUE TRUE
> x[!is.na(x)]
[1] 1 2 3
Related
I'd like to understand what's going on in this piece of R code I was testing. I'd like to replace part of a vector with another vector. The original and replacement values are in a data.frame. I'd like to replace all elements of the vector that match the original column with the corresponding replacement values. I have the answer to the larger question, but I'm unable to understand how it works.
Here's a simple example:
> vecA <- 1:5;
> vecB <- data.frame(orig=c(2,3), repl=c(22,33));
> vecA[vecA %in% vecB$orig] <- vecB$repl #Question-1
> vecA
[1] 1 22 33 4 5
> vecD<-data.frame(orig=c(5,7), repl=c(55,77))
> vecA[vecA %in% vecD$orig] <- vecD$repl #Question-2
Warning message:
In vecA[vecA %in% vecD$orig] <- vecD$repl :
number of items to replace is not a multiple of replacement length
> vecA
[1] 1 22 33 4 55
Here are my questions:
How does the assignment on Line-3 work? The LHS expression is a 2-item vector, whereas the RHS is a 5-element vector.
Why does the assignment on Line-6 give a warning (but still work)?
The First Question
R goes through each element in vecA and checks to see if it exists in vecB$orig. The %in% operator will return a boolean. If you run the command vecA %in% vecB$orig you get the following:
[1] FALSE TRUE TRUE FALSE FALSE
which is telling you that in the vector 1 2 3 4 5 it sees 2 and 3 in vecB$orig.
By subsetting vecA by this command, you are isolating only the TRUE values in vecA, so vecA[vecA %in% vecB$orig] returns:
[1] 2 3
On the RHS, you are re-assigning wherever vecA[vecA %in% vecB$orig] equals TRUE to vecB$repl, which will replace 2 3 in vecA with 22 33.
The Second Question
In this case, the same logic applies for subsetting, but running vecA[vecA %in% vecD$orig] gives you
[1] 5
as 7 does not exist in vecA. You are trying to replace a vector of length 1 with a vector of length 2, which is what triggers the warning. In this case, it will just replace the first element of vecD$repl which happens to be 55.
Following is related to R language.
x1 <- c(1, 4, 3, NA, 7)
is.na(x1) <- which(x1 == 7)
I don't undertand, the LHS in last line gives you a vector of boolean and RHS is a value(index where x ==7, 5 in this case). So what does it mean to assign a boolean vector a value of 5?
is.na from the docs returns:
The default method for is.na applied to an atomic vector returns a logical vector of the same length as its argument x, containing TRUE for those elements marked NA or, for numeric or complex vectors, NaN, and FALSE otherwise.
Therefore, by making a logical vector(you're in essence saying wherever an index is TRUE, this should be an NA.
By "matching" these indices to the corresponding index from which, you're turning the latter into NAs wherever FALSE hence the change.
To put it in practice:
This is the output from is.na(x1):
is.na(x1)
[1] FALSE FALSE FALSE TRUE FALSE
The corresponding output from which(x==7):
which(x1 == 7)
[1] 5
Combining, the element at position 5 will now become an NA because it has been given the logical is.na() which returns TRUE
is.na(x1) <- which(x1 == 7)
x1
[1] 1 4 3 NA NA
The above turns the first index into an NA and appends two more NAs so as to make index 7 and NA.
This can be best seen by:
is.na(x1) <- c(1,7)
x1
[1] NA 4 3 NA 7 NA NA
Compare with this example from the docs:
(xx <- c(0:4))
is.na(xx) <- c(2, 4)
xx
[1] 0 NA 2 NA 4
From the above, it is clear that c(2,4) follows the original indices in xx hence the rest become NAs.
I have the following matrices :
> matrix <- matrix(c(1,3,4,NA,NA,NA,3,0,4,6,0,NA,2,NA,NA,2,0,1,0,0), nrow=5,ncol=4)
> n <- matrix(c(1,2,5,6,2),nrow=5,ncol=1)
As you can see, for each rows I have
multiple NAs - the number NAs is undefined
ONE single "0"
I would like to subset the 0 for the values of the n. Intended output below.
> output <- matrix(c(1, 3, 4,NA,NA,NA,3,5,4,6,1,NA,2,NA,NA,2,2,1,6,2), nrow=5,ncol=4)
I have tried the following
subset <- matrix == 0 & !is.na(matrix)
matrix[subset] <- n
#does not give intended output, but subset locates the values i want to change
When used on my "real" data i get the following message :
Warning message: In m[subset] <- n : number of items to replace is not
a multiple of replacement length
Thanks
EDIT : added a row to the matrix, as my real life problem is with an unbalanced matrix. I am using Matrices and not DF here, because i think (not sure)that with very large datasets, R is quicker with large matrices rather than subsets of dataframes.
We can do this using
out1 <- matrix+n[row(matrix)]*(matrix==0)
identical(output, out1)
#[1] TRUE
It appears you want to replace the values by row, but subsetting is replacing the values by column (and maybe that's not a completely thorough explanation). Transposing the matrix will get the desired output:
matrix <- t(matrix)
subset <- matrix == 0 & !is.na(matrix)
matrix[subset] <- n
matrix <- t(matrix)
setequal(output, matrix)
[1] TRUE
You can try this option with ifelse:
ifelse(matrix == 0, c(n) * (matrix == 0), matrix)
# [,1] [,2] [,3] [,4]
#[1,] 1 NA 1 2
#[2,] 3 NA NA 2
#[3,] 4 3 5 NA
#[4,] NA 6 NA 2
zero = matrix == 0
identical(ifelse(zero, c(n) * zero, matrix), output)
# [1] TRUE
So I know to determine the first occurrence of a specific element in each row you use the apply function with which.max or which.min. Here is the code that I am using right now.
x <- matrix(c(20,9,4,16,6,2,14,3,1),nrow=3)
x
apply(3 >= x,1,which.max )
This produces and output of:
[1] 1 3 2
Now when I try to do the same thing on a different matrix "x2"
x2 <- matrix(c(3,9,4,16,6,2,14,3,1),nrow=3)
x2
apply(3 >= x2,1,which.max )
The output is the same;
[1] 1 3 2
But for "x2" it is correct because the "x2" matrix's first row does have a value less than or equal to three.
Now my question which is probably something simple is why do the apply functions produce the same thing for "x" and "x2". For "x" below I would want something like:
[1] 0 3 2
Or maybe even something like this:
[1] NA 3 2
I have seen questions on stack overflow before on which.max not producing NAs and the answer was to just use the which() function, but since I am using a matrix and I want the first occurrence I do not have that luxury... I think.
We could replace the values in 'x' that are >3 with a very small number, for e.g. -999 or the value that is lower than in the minimum value in the dataset. Get the index of the replaced vector with which.max and multiply with a logical index to take care of cases where there are only negative values. i.e. in the case of 'x', the first row is all greater than 3. So by replacing with -999, the which.max returns 1 as the index but we prefer to have it NA or 0. By using sum(x1>0, the first row will be '0' and negating (!), it converts to TRUE, negate once again and it returns FALSE. Multiplying the logical index coerces to binary (0/1) and we get the '0' value for the first case.
apply(x, 1, function(x) {x1 <- ifelse(x>3, -999, x)
which.max(x1)*(!!sum(x1>0))})
#[1] 0 3 2
apply(x2, 1, function(x) {x1 <- ifelse(x>3, -999, x)
which.max(x1)*(!!sum(x1>0))})
#[1] 1 3 2
Another option is using max.col
x1 <- replace(x, which(x>3), -999)
max.col(x1)*!!rowSums(x1>0)
#[1] 0 3 2
x2N <- replace(x2, which(x2>3), -999)
max.col(x2N)*!!rowSums(x2N>0)
#[1] 1 3 2
Or a slight modification would be
indx <- x*(x <=3)
max.col(indx)*!!rowSums(indx)
#[1] 0 3 2
Put a column in front of '(3>=x)' that is Infinity, if and only if all entries in the corresponding row of 'x' are larger than 3, and otherwise NaN. Then apply 'which.max' rowwise, and finally subtract 1, because of the extra column:
x <- matrix(c(20,9,4,16,6,2,14,3,1),nrow=3)
a <- (!apply(3>=x,1,max))*Inf
apply( cbind(a,3>=x), 1, which.max ) - 1
This gives '0,3,2' 'which.max' is applied to the extended matrix
> cbind(a,3>=x)
a
[1,] Inf 0 0 0
[2,] NaN 0 0 1
[3,] NaN 0 1 1
I am running a loop in R to find indices of a vector when its elements are equal to elements of a reference vector.
As far as I know R, I need to declare the variable before the for-loop, but in this case I do not know the final length of my indices vector (see code below).
How can I create a variables that allows R to change its size during the for loop?
extract of my code:
k <- 1
for(i in 1:length(Lid.time)){
ind <- which(Net.time==Lid.time[i])
if(length(ind)>0){
ind.Net[k] <- ind
k <- k+1
}
}
Notes about the code:
Lid.time is a vector of a different lenght than Net.time.
I need to find an array of indices that tells me where Net.time is equal to Lid.time. I do not know in advance how long will the ind.Net vector will be, so how can I declare the vector ind.Net?
Thanks for your help
As Dason stated, match will work just fine for that specific task:
>a <- seq(2,20,2)
#[1] 2 4 6 8 10 12 14 16 18 20
>b <- c(4,14,18)
>match(b,a)
#[1] 2 7 9 # The indices!
>a %in% b #shorthand logical version of match
#[1] FALSE TRUE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE
But to answer your question of a vector of unknown length within a loop:
Vector <- c()
for(i in sample(1:100,20)) {
if(i<50) {Vector <- append(Vector, i)}
}
length(HowLongIsThisVector)
It will be different every time you run it because of sample.
No need for a loop as it sounds like match does what you want.
a <- 1:10
b <- c(2, 7, 9)
match(a, b)
# [1] NA 1 NA NA NA NA 2 NA 3 NA