I have for example vectors like the following:
a= c(1, NA, NA, 2, 3)
b=c(NA, 1, NA, NA, NA)
c=c(NA, NA, 5, NA, NA)
I wish to merge the three vectors to get
d=c(1,1,5,2,3)
Is there a way of doing this without extensive looping? Many thanks :)
You could try
rowSums(cbind(a,b,c), na.rm=TRUE)
#[1] 1 1 5 2 3
or
mat <- cbind(a,b,c)
mat[cbind(1:nrow(mat),max.col(!is.na(mat)))]
#[1] 1 1 5 2 3
Or
ind <- which(!is.na(mat), arr.ind=TRUE)
mat[ind[order(ind[,1]),]]
#[1] 1 1 5 2 3
I would consider pmin or pmax for a more direct approach given the conditions you describe:
pmin(a, b, c, na.rm = TRUE)
# [1] 1 1 5 2 3
pmax(a, b, c, na.rm = TRUE)
# [1] 1 1 5 2 3
Related
There are multiple ways to fill missing values in R. However, I can't find a solution for filling just the last n NAs.
Available options:
na_vector <- c(1, NA, NA, NA, 2, 3, NA, NA)
library(zoo)
na.locf(na_vector)
# Outputs: [1] 1 1 1 1 2 3 3 3
na.locf0(na_vector, maxgap = 2)
# Outputs: [1] 1 NA NA NA 2 3 3 3
How I would like it to be:
na_vector <- c(1, NA, NA, NA, 2, 3, NA, NA)
fill_na <- function(vector, n){
...
}
fill_na(na_vector, n = 1)
# Outputs: [1] 1 1 NA NA 2 3 3 NA
fill_na(na_vector, n = 2)
# Outputs: [1] 1 1 1 NA 2 3 3 3
Here is an option to get those outputs using dplyr and recursion:
na_vector <- c(1, NA, NA, NA, 2, 3, NA, NA)
fill_na <- function(vector, n){
if (n == 0) {
vector
} else {
fill_na(
vector = dplyr::coalesce(vector, dplyr::lag(vector)),
n = n - 1
)
}
}
fill_na(na_vector, n = 1)
# [1] 1 1 NA NA 2 3 3 NA
fill_na(na_vector, n = 2)
# [1] 1 1 1 NA 2 3 3 3
Number the NA's in each consecutive run of NA's giving a and then only fill in those with a number less than or equal to n. This uses only vector operations internally and no iteration or recursion.
library(collapse)
library(zoo)
fill_na <- function(x, n) {
a <- ave(x, groupid(is.na(x)), FUN = seq_along)
ifelse(a <= n, na.locf0(x), x)
}
fill_na(na_vector, 1)
## [1] 1 1 NA NA 2 3 3 NA
fill_na(na_vector, 2)
## [1] 1 1 1 NA 2 3 3 3
Here is a solution to impute everything except the last n NA's based on base R + imputeTS.
library(imputeTS)
na_vector <- c(1, NA, NA, NA, 2, 3, NA, NA)
# The function that allows imputing everything except the last n NAs
fill_except_last_n_na <- function(x, n) {
index <- which(rev(cumsum(rev(as.numeric(is.na(x))))) == n+1)
x[1:tail(index,1)] <- na_locf(x[1:tail(index,1)])
return(x)
}
# Call the new function
fill_except_last_n_na(na_vector,2)
## Result
[1] 1 1 1 1 2 3 NA NA
When you want to use another imputation option than last observation carried forward, you can just replace the na_locf with na_ma (moving average), na_interpolation (interpolation), na_kalman (Kalman Smooting on State Space Models) or other imputation function provided by the imputeTS package (see also in the imputeTS documentation for a list of functions.
Suppose I have this example dataset df with only character variables.
dx_order1<-c(1, 1, NA, 1, 1)
dx_order2<-c(2, 2, 2, 2, NA)
Suppose that these variables are numeric.
I want to recode the variables. For dx_order1 variable, I want to recode 1 as 1 and 0 otherwise. Similarly, for dx_order 2 variable I want to recode 2 as 1 and 0 otherwise. Say that the new variables are called diag_order1 and diag_order2.
I know how to do this one by one in a manual fashion. The codes below will do the job:
df$diag_order1 <- ifelse(is.na(df$dx_order1), 0, 1)
df$diag_order1 <- ifelse(is.na(df$dx_order1), 0, 1)
I was wondering how I can achieve the same outcome with for loop function. If I have a a lot of similar variables then this type of manual coding is not practical. So any advice on how to have a loop to fasten the process would be appreciated.
You don't need to use loop in this instance, you could do this by converting NA to 0 using is.na. For example:
Data
df <- data.frame(dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
df[!is.na(df)] <- 1
df[is.na(df)] <- 0
Or if you have more columns with NA but only want to apply to certain columns then you could do it by specifying those columns:
df2 <- data.frame(letter_col = c(NA, letters[1:4]),
dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
# any columns starting with dx
cols <- names(df2)[grepl("^dx", names(df2))]
df2[, cols][!is.na(df2[, cols])] <- 1
df2[, cols][is.na(df2[, cols])] <- 0
You can use across with mutate in dplyr like this
library(dplyr)
df2 <- data.frame(letter_col = c(NA, letters[1:4]),
dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
> df2
letter_col dx_order1 dx_order2
1 <NA> 1 2
2 a 1 2
3 b NA 2
4 c 1 2
5 d 1 NA
df2 %>% mutate(across(starts_with("dx"), ~case_when(. == as.numeric(str_extract(cur_column(), "\\d$")) ~ 1,
is.na(.) ~ 0,
TRUE ~ 0), .names = "diag_{.col}"))
letter_col dx_order1 dx_order2 diag_dx_order1 diag_dx_order2
1 <NA> 1 2 1 1
2 a 1 2 1 1
3 b NA 2 0 1
4 c 1 2 1 1
5 d 1 NA 1 0
Assuming that your dx column can have values like suffix, NA and otherwise too as written in your question, and it recodes everything else than suffix to 0
You can coerce the logical vector from is.na to integer. is.na works with the dataframe.
df <- data.frame(dx_order1 = c(1,1, NA, 1, 1),
dx_order2 = c(2, 2, 2, 2, NA))
df[] <- +!is.na(df)
df
# dx_order1 dx_order2
#1 1 1
#2 1 1
#3 0 1
#4 1 1
#5 1 0
This question already has an answer here:
remove rows where all columns are NA except 2 columns [duplicate]
(1 answer)
Closed 3 years ago.
I have a data.table which was formed by taking the differences between two panel observations using:
tab <- tab[,
lapply(.SD, function(x) x - shift(x)),
by = A,
.SDcols = (sapply(tab, is.numeric))
]
tab = data.table(A = c(1, 1, 2, 2), B = c(NA, 2, NA, 1), C = c(NA, NA, NA, 2), D=c(NA, 3, NA, 2))
tab
A B C D
1: 1 NA NA NA
2: 1 2 NA 3
3: 2 NA NA NA
4: 2 1 2 2
I would like to use this answer:
tab <- tab [!Reduce(`&`, lapply(tab , is.na))]
to remove rows 1 and 3, but this does not work because the first column is not NA. How can I adapt the code to solve this?
Desired outcome:
A B C D
1: 1 2 NA 3
2: 2 1 2 2
tab[tab[, rowSums(!is.na(.SD)) > 1, .SDcols = -1]]
In this case we can specify the columns in .SDcols
tab[tab [, !Reduce(`&`, lapply(.SD , is.na)), .SDcols = 2:ncol(tab)]]
Imagine a small data set like the one below, composed of three variables:
v1 <- c(0, 1, NA, 1, NA, 0)
v2 <- c(0, 0, NA, 1, NA, NA)
v3 <- c(1, NA, 0, 0, NA, 0)
df <- data.frame(v1, v2, v3)
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
One can use the is.na command as follows to calculate the number of rows with at least one missing value - and R would return 4:
sum(is.na(df$v1) | is.na(df$v2) | is.na(df$v3))
Or the number of rows with all three values missing - and R would return 1:
sum(is.na(df$v1) & is.na(df$v2) & is.na(df$v3))
Two questions at this point:
(1) How can I calculate the number of rows where "exactly one" or "exactly two" values are missing?
(2) If I am to do the above in a large data set, how can I limit the scope of the calculation to v1, v2 and v3 (that is, without having to create a subset)?
I tried variations of is.na, nrow and df, but could not get any of them to work.
Thanks!
We can use rowSums on the logical matrix (is.na(df)) and check whether the number of NAs are equal to the value of interest.
n1 <- 1
sum(rowSums(is.na(df))==n1)
To make it easier, create a function to do this
f1 <- function(dat, n){
sum(rowSums(is.na(dat)) == n)
}
f1(df, 0)
#[1] 2
f1(df, 1)
#[1] 2
f1(df, 3)
#[1] 1
f1(df, 2)
#[1] 1
NOTE: rowSums is very fast, but if it is a large dataset, then creating a logical matrix can also create problems in memory. So, we can use Reduce after looping through the columns of the dataset (lapply(df, is.na)).
sum(Reduce(`+`, lapply(df, is.na))==1)
#[1] 2
f2 <- function(dat, n){
sum(Reduce(`+`, lapply(dat, is.na))==n)
}
f2(df, 1)
Try this:
num.rows.with.x.NA <- function(df, x, cols=names(df)) {
return(sum(apply(df, 1, function(y) sum(is.na(y[cols])) == x)))
}
df
v1 v2 v3
1 0 0 1
2 1 0 NA
3 NA NA 0
4 1 1 0
5 NA NA NA
6 0 NA 0
num.rows.with.x.NA(df, 0, names(df))
#[1] 2
num.rows.with.x.NA(df, 1, names(df))
#[1] 2
num.rows.with.x.NA(df, 2, names(df))
#[1] 1
num.rows.with.x.NA(df, 3, names(df))
#[1] 1
I was wondering if anyone had a quick and dirty solution to the following problem, I have a matrix that has rows of NAs and I would like to replace the rows of NAs with the previous row (if it is not also a row of NAs).
Assume that the first row is not a row of NAs
Thanks!
Adapted from an answer to this question: Idiomatic way to copy cell values "down" in an R vector
f <- function(x) {
idx <- !apply(is.na(x), 1, all)
x[idx,][cumsum(idx),]
}
x <- data.frame(a=c(1, 2, NA, 3, NA, NA), b=c(4, 5, NA, 6, NA, 7))
> x
a b
1 1 4
2 2 5
3 NA NA
4 3 6
5 NA NA
6 NA 7
> f(x)
a b
1 1 4
2 2 5
2.1 2 5
4 3 6
4.1 3 6
6 NA 7
Trying to think of times you may have two all NA rows in a row.
#create a data set like you discuss (in the future please do this yourself)
set.seed(14)
x <- matrix(rnorm(10), nrow=2)
y <- rep(NA, 5)
v <- do.call(rbind.data.frame, sample(list(x, x, y), 10, TRUE))
One approach:
NArows <- which(apply(v, 1, function(x) all(is.na(x)))) #find all NAs
notNA <- which(!seq_len(nrow(v)) %in% NArows) #find non NA rows
rep.row <- sapply(NArows, function(x) tail(notNA[x > notNA], 1)) #replacement rows
v[NArows, ] <- v[rep.row, ] #assign
v #view
This would not work if your first row is all NAs.
You can always use a loop, here assuming that 1 is not NA as indicated:
fill = data.frame(x=c(1,NA,3,4,5))
for (i in 2:length(fill)){
if(is.na(fill[i,1])){ fill[i,1] = fill[(i-1),1]}
}
If m is your matrix, this is your quick and dirty solution:
sapply(2:nrow(m),function(i){ if(is.na(m[i,1])) {m[i,] <<- m[(i-1),] } })
Note it uses the ugly (and dangerous) <<- operator.
Matthew's example:
x <- data.frame(a=c(1, 2, NA, 3, NA, NA), b=c(4, 5, NA, 6, NA, 7))
na.rows <- which( apply( x , 1, function(z) (all(is.na(z)) ) ) )
x[na.rows , ] <- x[na.rows-1, ]
x
#---
a b
1 1 4
2 2 5
3 2 5
4 3 6
5 3 6
6 NA 7
Obviously a first row with all NA's would present problems.
Here is a straightforward and conceptually perhaps the simplest one-liner:
x <- data.frame(a=c(1, 2, NA, 3, NA, NA), b=c(4, 5, NA, 6, NA, 7))
a b
1 1 4
2 2 5
3 NA NA
4 3 6
5 NA NA
6 NA 7
x1<-t(sapply(1:nrow(x),function(y) ifelse(is.na(x[y,]),x[y-1,],x[y,])))
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 2 5
[4,] 3 6
[5,] 3 6
[6,] NA 7
To put the column names back, just use colnames(x1)<-colnames(x)