Delete only an entire row with NA, in R - r

I have something like this in my dataset and I only want to delete a row if it only has NA's, not if it has at least one value.
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] NA NA NA
[5,] 4 8 NA
In this example they were able to delete what i want, but when i try to do in the exact same way, it doesn't work.
I've already tried their example:
data[rowSums(is.na(data)) != ncol(data),]
But my row's number don't change like this one.
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 1 NA 4
[3,] 4 6 7
[4,] 4 8 NA
My NA's are not characters.if i ask for their class:
class(NA)
[1] "logical"
Do you know another way to ask for these, please?
______UPDATE_____
Maybe I said it wrong.
My problem, and it's why there code is not working
mymat[rowSums(is.na(mymat)) != ncol(mymat), ]
Because i have 3 columns with information but after that, is everything NA, like this:
Date Product Code protein fat
2016-01-01 aaa 0001 NA NA
2016-01-01 bbb 0003 NA NA
2016-02-01 ccc 0032 NA NA
So the row is not entirly NA's, only after the 3rd column... But i want to remove the entire row.. (1:5)
Thank you!

First, I would coerce the matrix to a data frame, because this is the typical ("tidy") format to store variables and observations. Then you could use the remove_empty_rows() function from the sjmisc-package:
library(sjmisc)
df <- data.frame(
a = c(1, 1, 4, NA, 4),
b = c(2, NA, 6, NA, 8),
c = c(3, 4, 7, NA, NA)
)
# get row numbers of empty rows
empty_rows(df)
## [1] 4
# remove empty rows
remove_empty_rows(df)
## A tibble: 4 × 3
## a b c
## * <dbl> <dbl> <dbl>
## 1 1 2 3
## 2 1 NA 4
## 3 4 6 7
## 4 4 8 NA
There are also functions for columns: empty_cols() and remove_empty_cols().
If you just want to keep complete cases (rows), use complete.cases():
df[complete.cases(df), ]
## a b c
## 1 1 2 3
## 3 4 6 7

Check if this will work with the updated explanation. It will subset the data.frame to ignore the information columns when checking for NA. I added some additional rows that contain a mix of numbers and NA
df1 <- data.frame(Date=c("2016-01-01", "2016-01-01", "2016-02-01", "2016-03-01", "2016-03-01"),
Product=c("aaa", "bbb", "ccc", "ddd", "eee"),
Code=c("0001", "0003", "0032", "0005", "0007"),
protein=c(NA, NA, NA, 5, NA),
fat=c(NA, NA, NA, NA, 4))
# place any columns you do not want to check for NA in names.info
names.info <- c("Date", "Product", "Code")
names.check <- setdiff(names(df1), names.info)
df1[rowSums(is.na(df1[, names.check])) != length(names.check), ]
Date Product Code protein fat
4 2016-03-01 ddd 0005 5 NA
5 2016-03-01 eee 0007 NA 4

You need to delete the as.integer
mymat <- matrix(c(1:3, NA, 4:6, NA, rep(NA, 4)), ncol = 3)
Which translates to
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 2 5 NA
[3,] 3 6 NA
[4,] NA NA NA
mymat[as.integer(rowSums(is.na(mymat)) != ncol(mymat)), ]
Gives you
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 1 4 NA
[3,] 1 4 NA
But you want
mymat[rowSums(is.na(mymat)) != ncol(mymat), ]
To get
[,1] [,2] [,3]
[1,] 1 4 NA
[2,] 2 5 NA
[3,] 3 6 NA
Cheers,
Marc

Related

Cbind get environment objects in R

I would like cbind the vectors of same dimension using a vector of their names.
For example I would like get from
a <- c(2, 5, NA, NA, 6, NA)
b <- c(NA, 1, 3, 4, NA, 8)
A matrix using cbind(a,b)
a b
[1,] 2 NA
[2,] 5 1
[3,] NA 3
[4,] NA 4
[5,] 6 NA
[6,] NA 8
but calling variables from a vector of environment objects names, e.g. vectornames <- c("a","b")
My last try failed on cbind(for(i in vectornames) get(i))
You want to sapply/lapply the get function here. For example:
a <- c(2, 5, NA, NA, 6, NA)
b <- c(NA, 1, 3, 4, NA, 8)
nmes <- c("a", "b")
# Apply get() to each name in the nmes vector
# Then convert the resulting matrix to a data frame
as.data.frame(sapply(nms, get))
a b
1 2 NA
2 5 1
3 NA 3
4 NA 4
5 6 NA
6 NA 8
Technically you can do this using cbind, but it's more awkward:
# Convert the vector of names to a list of vectors
# Then bind those vectors together as columns
do.call(cbind, lapply(nms, get))
We can use mget to 'get' a list, then "loop-unlist" with sapply and function(x) x or [ to create a matrix
sapply(mget(vectornames), \(x) x)
#OR
sapply(mget(vectornames), `[`)
a b
[1,] 2 NA
[2,] 5 1
[3,] NA 3
[4,] NA 4
[5,] 6 NA
[6,] NA 8

Setting matrix values comparing to vector in R

I want to set NA's in every element of a matrix where the value in a column is greater than or equal to the value of a given vector. For example, I can create a matrix:
set.seed(1)
zz <- matrix(data = round(10L * runif(12)), nrow = 4, ncol = 3)
which gives for zz:
[,1] [,2] [,3]
[1,] 8 5 7
[2,] 6 5 1
[3,] 5 10 3
[4,] 9 1 9
and for the comparison vector (for example):
xx <- round(10L * runif(4))
where xx is:
[1] 6 3 8 2
if I perform this operation:
apply(zz,2,function(x) x >= xx)
I get:
[,1] [,2] [,3]
[1,] TRUE FALSE TRUE
[2,] TRUE TRUE FALSE
[3,] FALSE TRUE FALSE
[4,] TRUE FALSE TRUE
What I want is everywhere I have a TRUE element I want an NA and everywhere I have a FALSE I get the number in the zz matrix (e.g., manually ...):
NA 5 NA
NA NA 1
5 NA 3
NA 1 NA
I can cobble together some "for" loops to do what I want, but is there a vector-based way to do this??
Thanks for any tips.
You could simply do:
zz[zz>=xx] <- NA
# [,1] [,2] [,3]
#[1,] NA 5 NA
#[2,] NA NA 1
#[3,] 5 NA 3
#[4,] NA 1 NA
Here is one option to get the expected output. We get a logical matrix (zz >= xx), using NA^ on that returns NA for the TRUE values and 1 for the FALSE, then multiply it with original matrix 'zz' so that NA remains as such while the 1 changes to the corresponding value in 'zz'.
NA^(zz >= xx)*zz
# [,1] [,2] [,3]
#[1,] NA 5 NA
#[2,] NA NA 1
#[3,] 5 NA 3
#[4,] NA 1 NA
Or another option is ifelse
ifelse(zz >= xx, NA, zz)
data
zz <- structure(c(8, 6, 5, 9, 5, 5, 10, 1, 7, 1, 3, 9), .Dim = c(4L, 3L))
xx <- c(6, 3, 8, 2)

Interpolate multiple NA values with R

I want to interpolate multiple NA values in a matrix called, tester.
This is a part of tester with only 1 column of NA values, in the whole 744x6 matrix other columns have multiple as well:
ZONEID TIMESTAMP U10 V10 U100 V100
1 20121022 12:00 -1.324032e+00 -2.017107e+00 -3.278166e+00 -5.880225574
1 20121022 13:00 -1.295168e+00 NA -3.130429e+00 -6.414975148
1 20121022 14:00 -1.285004e+00 NA -3.068829e+00 -7.101699541
1 20121022 15:00 -9.605904e-01 NA -2.332645e+00 -7.478168285
1 20121022 16:00 -6.268261e-01 -3.057278e+00 -1.440209e+00 -8.026791079
I have installed the zoo package and used the code library(zoo). I have tried to use the na.approx function, but it returns on a linear basis:
na.approx(tester)
# Error ----> need at least two non-NA values to interpolate
na.approx(tester, rule = 2)
# Error ----> need at least two non-NA values to interpolate
na.approx(tester, x = index(tester), na.rm = TRUE, maxgap = Inf)
Afterward I tried:
Lines <- "tester"
library(zoo)
z <- read.zoo(textConnection(Lines), index = 2)[,2]
na.approx(z)
Again I got the same multiple NA values error. I also tried:
z <- zoo(tester)
index(Cz) <- Cz[,1]
Cz_approx <- na.approx(Cz)
Same error.
I must be doing something really stupid, but I would really appreciate your help.
You may apply na.approx only on columns with at least two non-NA values. Here I use colSums on a boolean matrix to find relevant columns.
# create a small matrix
m <- matrix(data = c(NA, 1, 1, 1, 1,
NA, NA, 2, NA, NA,
NA, NA, NA, NA, 2,
NA, NA, NA, 2, 3),
ncol = 5, byrow = TRUE)
m
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 1 1 1 1
# [2,] NA NA 2 NA NA
# [3,] NA NA NA NA 2
# [4,] NA NA NA 2 3
library(zoo)
# na.approx on the entire matrix does not work
na.approx(m)
# Error in approx(x[!na], y[!na], xout, ...) :
# need at least two non-NA values to interpolate
# find columns with at least two non-NA values
idx <- colSums(!is.na(m)) > 1
idx
# [1] FALSE FALSE TRUE TRUE TRUE
# interpolate 'TRUE columns' only
m[ , idx] <- na.approx(m[ , idx])
m
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 1 1 1.000000 1.0
# [2,] NA NA 2 1.333333 1.5
# [3,] NA NA NA 1.666667 2.0
# [4,] NA NA NA 2.000000 3.0

Consecutive NAs in a column

I'd like to remove the rows that got more than 3 consecutive NAs in one column.
[,1] [,2]
[1,] 1 1
[2,] NA 1
[3,] 2 4
[4,] NA 3
[6,] 1 4
[7,] NA 8
[8,] NA 5
[9,] NA 6
so I'd have this data
[,1] [,2]
[1,] 1 1
[2,] NA 1
[3,] 2 4
[4,] NA 3
[6,] 1 4
I did a research and I tried this code
data[! rowSums(is.na(data)) >3 , ]
but I think this is only used for consecutive NAs in a row.
As mentioned, rle is a good place to start:
is.na.rle <- rle(is.na(data[, 1]))
Since NAs are "bad" only when they come by three or more, we can re-write the values:
is.na.rle$values <- is.na.rle$values & is.na.rle$lengths >= 3
Finally, use inverse.rle to build the vector of indices to filter:
data[!inverse.rle(is.na.rle), ]
You could use rle, or you could do this:
library(data.table)
d = data.table(a = c(1,NA,2,NA,3,4,NA,NA,NA), b = c(1:9))
d[d[, if(.N > 3) {.I[1]} else {.I}, by = cumsum(!is.na(a))]$V1]
# a b
#1: 1 1
#2: NA 2
#3: 2 3
#4: NA 4
#5: 3 5
#6: 4 6
Run d[, cumsum(!is.na(a))] to see why this works. Also, I could've used .SD instead of .I to get cleaner code, but opted for efficiency instead.
As #DirkEddelbuettel suggested, the rle() function will help. You can create your own function to identify the elements of a vector with 3 or more consecutive NA values.
consecna <- function(x, n=3) {
# function to identify elements with n or more consecutive NA values
y <- rle(is.na(x))
y$values <- y$lengths > (n - 0.5) & y$values
inverse.rle(y)
}
Then you can apply this function to each column of your matrix.
# example matrix of data
m <- matrix(c(1, NA, 2, NA, 1, NA, NA, NA, 1, 1, 4, 3, 4, 8, 5, 6), ncol=2)
# index matrix identifying elements with 3 or more consecutive NA values
mindex <- apply(m, 2, consecna)
Then use the created index matrix to get rid of all those rows that were identified.
# removal of all the identified rows
m2 <- m[!apply(mindex, 1, any), ]

transforming dataset (similarity ratings)

I want to transform the following data format (simplified representation):
image1 image2 rating
1 1 2 6
2 1 3 5
3 1 4 7
4 2 3 3
5 2 4 5
6 3 4 1
Reproduced by:
structure(list(image1 = c(1, 1, 1, 2, 2, 3), image2 = c(2, 3,
4, 3, 4, 4), rating = c(6, 5, 7, 3, 5, 1)), .Names = c("image1",
"image2", "rating"), row.names = c(NA, -6L), class = "data.frame")
To a format where you get a sort of correlation matrix, where the first two columns figure as indicators, and ratings are the values:
1 2 3 4
1 NA 6 5 7
2 6 NA 3 5
3 5 3 NA 1
4 7 5 1 NA
Does any of you know of a function in R to do this?
I would rather use matrix indexing:
N <- max(dat[c("image1", "image2")])
out <- matrix(NA, N, N)
out[cbind(dat$image1, dat$image2)] <- dat$rating
out[cbind(dat$image2, dat$image1)] <- dat$rating
# [,1] [,2] [,3] [,4]
# [1,] NA 6 5 7
# [2,] 6 NA 3 5
# [3,] 5 3 NA 1
# [4,] 7 5 1 NA
I don't like the <<- operator very much, but it works for this (naming your structure s):
N <- max(s[,1:2])
m <- matrix(NA, nrow=N, ncol=N)
apply(s, 1, function(x) { m[x[1], x[2]] <<- m[x[2], x[1]] <<- x[3]})
> m
[,1] [,2] [,3] [,4]
[1,] NA 6 5 7
[2,] 6 NA 3 5
[3,] 5 3 NA 1
[4,] 7 5 1 NA
Not as elegant as Karsten's solution, but it does not rely on the order of the rows, nor does it require that all combinations be present.
Here is one approach, where dat is the data frame as defined in the question
res <- matrix(0, nrow=4, ncol=4) # dim may need to be adjusted
ll <- lower.tri(res, diag=FALSE)
res[which(ll)] <- dat$rating
res <- res + t(res)
diag(res) <- NA
This works only if the rows are ordered as in the question.

Resources