I have a character vector in R, and want to make a new vector with multiple NAs between the elements of the character vector. To simplify, the character vector is:
cv <- c( "A", "B", "C" )
Let's say we just want 3 NAs (actually need much more). Desired output vector would be:
"A", NA, NA, NA, "B", NA, NA, NA, "C", NA, NA, NA
I'm guessing this has been asked before, but it's very difficult to search for. I've tried various permutations and combinations of rep and rbind with no success. Be gentle; my first question :-)
Use sapply to concatenate c(NA, NA, NA) to each element of cv so that for each element of cv we get a 4-vector. sapply will arrange these into a 4 x n matrix (where n is the length of cv) and c on the left will unravel that matrix into a vector.
c(sapply(cv, c, rep(NA, 3)))
## [1] "A" NA NA NA "B" NA NA NA "C" NA NA NA
You can try to play it with matrix() and as.vector()
v <- as.vector(rbind(cv,matrix(nrow = 3,ncol = length(cv))))
such that
> v
[1] "A" NA NA NA "B" NA NA NA "C" NA NA
[12] NA
We could create a vector with NA's and replace cv elements based on position generated by seq.
n <- 3
vec <- rep(NA, (n + 1) * length(cv))
vec[seq(1, length(vec), n + 1)] <- cv
vec
#[1] "A" NA NA NA "B" NA NA NA "C" NA NA NA
Related
Beforehand
Most obvious answer to the title is that missings are represented with NA in R. Dummy data:
x <- c("a", "NA", "<NA>", NA)
We can transform all elements of x to characters using x_paste0 <- paste0(x). After doing so, the second and fourth elements are same ("NA") and to my knowledge this is why there is no way to backtransform x_paste0 to x.
addNA
But working with addNA indicates that it is not just the NA itself that represents missings. In x only the last element is a missing. Let's transform the vector:
x_new <- addNA(x)
x_new
[1] a NA <NA> <NA>
Levels: <NA> a NA <NA>
Interestingly, the fourth element, i.e. the missing is shown with <NA> and not with NA. Further, now the fourth element looks same as the third. And we are told that there are no missings because when we try any(is.na(x_new)) we get FALSE. At this point I would have thought that the information about what element is the missing (the third or the fourth) is simply lost as it was in x_paste0. But this is not true because we can actually backtransform x_new. See:
as.character(x_new)
[1] "a" "NA" "<NA>" NA
How does as.character know that the third element is "<NA>" and the fouth is an actual missing, i.e. NA?
That's probably a uncleanness in the base:::print.factor() method.
x <- c("a", "NA", "<NA>", NA)
addNA(x)
# [1] a NA <NA> <NA>
# Levels: <NA> a NA <NA>
But:
levels(addNA(x))
# [1] "<NA>" "a" "NA" NA
So, there are no duplicated levels.
Usually you try to prevent this when you read your data, either a csv or other source. A bit of a silly demo using read.table on your vector sample data.
x <- c("a", "NA", "<NA>", NA)
x <- read.table(text = x, na.strings = c("NA", "<NA>", ""), stringsAsFactors = F)$V1
x
[1] "a" NA NA NA
But if you want to fix it afterwards
x <- c("a", "NA", "<NA>", NA)
na_strings <- c("NA", "<NA>", "")
unlist(lapply(x, function(v) { ifelse(v %in% na_strings, NA, v) }))
[1] "a" NA NA NA
some notes on factors and addNA
# to not be confused with character values pretending to be missing values but are not
x <- c("a", "b", "c", NA)
x_1 <- addNA(x)
x_1
# do not get confused on how the displayed output is
# [1] a b c <NA>
# Levels: a b c <NA>
str(x_1)
# Factor w/ 4 levels "a","b","c",NA: 1 2 3 4
is.na(x_1) # as your actual values are 1, 2, 3, 4
# [1] FALSE FALSE FALSE FALSE
is.na(levels(x_1))
# [1] FALSE FALSE FALSE TRUE
# but nothing is lost
x_2 <- as.character(x_1)
str(x_2)
# chr [1:4] "a" "b" "c" NA
is.na(x_2)
# [1] FALSE FALSE FALSE TRUE
I would like cbind the vectors of same dimension using a vector of their names.
For example I would like get from
a <- c(2, 5, NA, NA, 6, NA)
b <- c(NA, 1, 3, 4, NA, 8)
A matrix using cbind(a,b)
a b
[1,] 2 NA
[2,] 5 1
[3,] NA 3
[4,] NA 4
[5,] 6 NA
[6,] NA 8
but calling variables from a vector of environment objects names, e.g. vectornames <- c("a","b")
My last try failed on cbind(for(i in vectornames) get(i))
You want to sapply/lapply the get function here. For example:
a <- c(2, 5, NA, NA, 6, NA)
b <- c(NA, 1, 3, 4, NA, 8)
nmes <- c("a", "b")
# Apply get() to each name in the nmes vector
# Then convert the resulting matrix to a data frame
as.data.frame(sapply(nms, get))
a b
1 2 NA
2 5 1
3 NA 3
4 NA 4
5 6 NA
6 NA 8
Technically you can do this using cbind, but it's more awkward:
# Convert the vector of names to a list of vectors
# Then bind those vectors together as columns
do.call(cbind, lapply(nms, get))
We can use mget to 'get' a list, then "loop-unlist" with sapply and function(x) x or [ to create a matrix
sapply(mget(vectornames), \(x) x)
#OR
sapply(mget(vectornames), `[`)
a b
[1,] 2 NA
[2,] 5 1
[3,] NA 3
[4,] NA 4
[5,] 6 NA
[6,] NA 8
I am just starting out on learning R and came across a piece of code as follows
vec_1 <- c("a","b", NA, "c","d")
# create a subet of all elements which equal "a"
vec_1[vec_1 == "a"]
The result from this is
## [1] "a" NA
Im just curious, since I am subsetting vec_1 for the value "a", why does NA also show up in my results?
This is because the result of anything == NA is NA. Even NA == NA is NA.
Here's the output of vec_1 == "a" -
[1] TRUE FALSE NA FALSE FALSE
and NA is not TRUE or FALSE so when you subset anything by NA you get NA. Check this out -
vec_1[NA]
[1] NA NA NA NA NA
When dealing with NA, R tries to provide the most informative answer i.e. T | NA returns TRUE because it doesn't matter what NA is. Here are some more examples -
T | NA
[1] TRUE
F | NA
[1] NA
T & NA
[1] NA
F & NA
[1] FALSE
R has no way to test equality with NA. In your case you can use %in% operator -
5 %in% NA
[1] FALSE
"a" %in% NA
[1] FALSE
vec_1[vec_1 %in% "a"]
[1] "a"
I want to interpolate multiple NA values in a matrix called, tester.
This is a part of tester with only 1 column of NA values, in the whole 744x6 matrix other columns have multiple as well:
ZONEID TIMESTAMP U10 V10 U100 V100
1 20121022 12:00 -1.324032e+00 -2.017107e+00 -3.278166e+00 -5.880225574
1 20121022 13:00 -1.295168e+00 NA -3.130429e+00 -6.414975148
1 20121022 14:00 -1.285004e+00 NA -3.068829e+00 -7.101699541
1 20121022 15:00 -9.605904e-01 NA -2.332645e+00 -7.478168285
1 20121022 16:00 -6.268261e-01 -3.057278e+00 -1.440209e+00 -8.026791079
I have installed the zoo package and used the code library(zoo). I have tried to use the na.approx function, but it returns on a linear basis:
na.approx(tester)
# Error ----> need at least two non-NA values to interpolate
na.approx(tester, rule = 2)
# Error ----> need at least two non-NA values to interpolate
na.approx(tester, x = index(tester), na.rm = TRUE, maxgap = Inf)
Afterward I tried:
Lines <- "tester"
library(zoo)
z <- read.zoo(textConnection(Lines), index = 2)[,2]
na.approx(z)
Again I got the same multiple NA values error. I also tried:
z <- zoo(tester)
index(Cz) <- Cz[,1]
Cz_approx <- na.approx(Cz)
Same error.
I must be doing something really stupid, but I would really appreciate your help.
You may apply na.approx only on columns with at least two non-NA values. Here I use colSums on a boolean matrix to find relevant columns.
# create a small matrix
m <- matrix(data = c(NA, 1, 1, 1, 1,
NA, NA, 2, NA, NA,
NA, NA, NA, NA, 2,
NA, NA, NA, 2, 3),
ncol = 5, byrow = TRUE)
m
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 1 1 1 1
# [2,] NA NA 2 NA NA
# [3,] NA NA NA NA 2
# [4,] NA NA NA 2 3
library(zoo)
# na.approx on the entire matrix does not work
na.approx(m)
# Error in approx(x[!na], y[!na], xout, ...) :
# need at least two non-NA values to interpolate
# find columns with at least two non-NA values
idx <- colSums(!is.na(m)) > 1
idx
# [1] FALSE FALSE TRUE TRUE TRUE
# interpolate 'TRUE columns' only
m[ , idx] <- na.approx(m[ , idx])
m
# [,1] [,2] [,3] [,4] [,5]
# [1,] NA 1 1 1.000000 1.0
# [2,] NA NA 2 1.333333 1.5
# [3,] NA NA NA 1.666667 2.0
# [4,] NA NA NA 2.000000 3.0
I'm using a data.frame:
data.frame("A"=c(NA,5,NA,NA,NA),
"B"=c(1,2,3,4,NA),
"C"=c(NA,NA,NA,2,3),
"D"=c(NA,NA,NA,7,NA))
This delivers a data.frame in this form:
A B C D
1 NA 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 4 2 7
5 NA NA 3 NA
My aim is to check each row of the data.frame, if there is a value greater than a specific one (let's assume 2) and to get the name of the columns where this is the case.
The desired output (value greater 2) should be:
for row 1 of the data.frame
x[1,]: c()
for row 2
x[2,]: c("A")
for row3
x[3,]: c("B")
for row4
x[4,]: c("B","D")
and for row5 of the data.frame
x[5,]: c("C")
Thanks for your help!
You can use which:
lapply(apply(dat, 1, function(x)which(x>2)), names)
with dat being your data frame.
[[1]]
character(0)
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "B" "D"
[[5]]
[1] "C"
EDIT
Shorter version suggested by flodel:
lapply(apply(dat > 2, 1, which), names)
Edit: (from Arun)
First, there's no need for lapply and apply. You can get the same just with apply:
apply(dat > 2, 1, function(x) names(which(x)))
But, using apply on a data.frame will coerce it into a matrix, which may not be wise if the data.frame is huge.
To answer #flodel's concerns, I'll write it as a separate answer:
1) Using lapply gets a list and apply doesn't guarantee this always:
A fair point. I'll illustrate the issue with an example:
df <- structure(list(A = c(3, 5, NA, NA, NA), B = c(1, 2, 3, 1, NA),
C = c(NA, NA, NA, 2, 3), D = c(NA, NA, NA, 7, NA)), .Names = c("A",
"B", "C", "D"), row.names = c(NA, -5L), class = "data.frame")
A B C D
1 3 1 NA NA
2 5 2 NA NA
3 NA 3 NA NA
4 NA 1 2 7
5 NA NA 3 NA
# using `apply` results in a vector:
apply(df, 1, function(x) names(which(x>2)))
# [1] "A" "A" "B" "D" "C"
So, how can we guarantee a list with apply?
By creating a list within the function argument and then use unlist with recursive = FALSE, as shown below:
unlist(apply(df, 1, function(x) list(names(which(x>2)))), recursive=FALSE)
[[1]]
[1] "A"
[[2]]
[1] "A"
[[3]]
[1] "B"
[[4]]
[1] "D"
[[5]]
[1] "C"
2) lapply is overall shorter, and does not require anonymous function:
Yes, but it's slower. Let me illustrate this on a big example.
set.seed(45)
df <- as.data.frame(matrix(sample(c(1:10, NA), 1e5 * 100, replace=TRUE),
ncol = 100))
system.time(t1 <- lapply(apply(df > 2, 1, which), names))
user system elapsed
5.025 0.342 5.651
system.time(t2 <- unlist(apply(df, 1, function(x)
list(names(which(x>2)))), recursive=FALSE))
user system elapsed
2.860 0.181 3.065
identical(t1, t2) # TRUE
3) All answers are wrong and the answer that'll work with all inputs:
lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)])
First, I don't get as to what's wrong. If you're talking about the list being unnamed, this can be changed by just setting the names just once at the end.
Second, unfortunately, using split on a huge data.frame which will result in too many split elements will be terribly slow (due to huge factor levels).
# testing on huge data.frame
system.time(t3 <- lapply(split(df, rownames(df)), function(x)names(x)[which(x > 2)]))
user system elapsed
517.545 0.312 517.872
Third, this orders the elements as 1, 10, 100, 1000, 10000, 100000, ... instead of 1 .. 1e5. Instead one could just use setNames or setnames (from data.table package) to just do this once finally, as shown below:
# setting names just once
t2 <- setNames(t2, rownames(df)) # by copy
# or even better using `data.table` `setattr` function to
# set names by reference
require(data.table)
tracemem(t2)
setattr(t2, 'names', rownames(df))
tracemem(t2)
Comparing the output doesn't show any other difference between the two (t3 and t2). You could run this to verify that the outputs are same (time consuming):
all(sapply(names(t2), function(x) all(t2[[x]] == t3[[x]])) == TRUE) # TRUE
why not do
colnames(df[,df[i,]>2])
for each row, where df is your data frame and i is the row number ;)