mutating data frame with search by row in R - r

I created this data frame as an illustration of a larger problem.
> df <- data.frame(x=c(NA, 12, NA, 67), y=c(32, NA, NA, NA), z=c(NA, NA, NA, NA))
> df
x y z
1 NA 32 NA
2 12 NA NA
3 NA NA NA
4 67 NA NA
I want it to look like this.
x
1 32
2 12
3 NA
4 67
Which is essentially searching through each row for a number. If one is found to return it matching that row, and if no number is found, return an NA.
I created an empty vector.
> list <- c()
Then a for loop that goes through each row returning the element that is not an NA value. Then add it to the 'list' vector.
> for (i in 1:4) {list <- c(list, df[i,!is.na(df[i,])])}
> list
[[1]]
[1] 32
[[2]]
[1] 12
[[3]]
[1] 67
> unlist(list)
32 12 67
This gets close, but the NA rows are ignored.
I also tried a grep pattern match. But as you can imagine, the grep family of calls are designed to run through vectors and not data frame rows.
Not sure how to move forward. Again, if it could look like:
x
1 32
2 12
3 NA
4 67

Use apply to check for values in each row:
apply(df, 1, function(x) { z <- x[!is.na(x)]; if(length(z)) z else NA})
# [1] 32 12 NA 67
Another strategy is to use rowSums, but this solution only works if there are no 0 values in your data.frame (if there are, this method will replace those results with NA):
x <- rowSums(df, na.rm = TRUE); x[x == 0] <- NA; x
# [1] 32 12 NA 67

You could use the Reduce function to combine columns pair by pair:
Reduce(function(x, y) {x[!is.na(y)] <- y[!is.na(y)] ; x}, df)
# [1] 32 12 NA 67
This function should work with non-numeric data, handles rows with multiple non-NA elements gracefully (it takes the rightmost), and should be a good deal more efficient than one relying on apply.
df.big <- df[rep(1:4, 1000),]
library(microbenchmark)
microbenchmark(apply(df.big, 1, function(x) { z <- x[!is.na(x)]; if(length(z)) z else NA}), {x <- rowSums(df.big, na.rm = TRUE); x[x == 0] <- NA; x}, Reduce(function(x, y) {x[!is.na(y)] <- y[!is.na(y)] ; x}, df.big))
# Unit: microseconds
# expr min
# apply(df.big, 1, function(x) { z <- x[!is.na(x)] if (length(z)) z else NA }) 14550.050
# { x <- rowSums(df.big, na.rm = TRUE) x[x == 0] <- NA x } 239.826
# Reduce(function(x, y) { x[!is.na(y)] <- y[!is.na(y)] x }, df.big) 353.326
# lq mean median uq max neval
# 15322.4825 19124.8814 17008.2935 22037.387 43337.893 100
# 257.2215 389.4275 380.6595 424.593 1585.234 100
# 384.4750 457.9714 436.2400 511.085 799.992 100
Basically the approach is about as efficient as the rowSums one proposed by #Thomas but can handle character and other data.

Related

Convert nested list with different names to data.frame filling NA and adding column

I need a base R solution to convert nested list with different names to a data.frame
mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z=list('k'))
convert(mylist)
## returns a data.frame:
##
## a b z
## 1 2 <NULL>
## 3 NA <NULL>
## NA 5 <NULL>
## 9 NA <chr [1]>
I know this could be easily done with dplyr::bind_rows or data.table::rbindlist with fill = TRUE (not ideal though since it fills character column with NULL, not NA), but I do really need a solution in base R. To simplify the problem, it is also fine with a 2-level nested list that has no 3rd level lists such as
mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z='k'))
convert(mylist)
## returns a data.frame:
##
## a b z
## 1 2 NA
## 3 NA NA
## NA 5 NA
## 9 NA k
I have tried something like
convert <- function(L) as.data.frame(do.call(rbind, L))
This does not fill NA and add additional column z
Update
mylist here is just a simplified example. In reality I could not assume the names of the sublist elements (a, b and z in the example), nor the sublists lengths (2, 1, 1, 2 in the example).
Here are the assumptions for expected data.frame and the input mylist:
The column number of the expected data.frame is determined by the maximum length of the sublists which could vary from 1 to several hundreds. There is no explicit source of information about the length of each sublist (which names will appear or disappear in which sublist is unknown)
max(sapply(mylist, length)) <= 1000 ## ==> TRUE
The row number of the expected data.frame is determined by the length of mylist which could vary from 1 to several thousands
dplyr::between(length(mylist), 0, 10000) ## ==> TRUE
No explicit information for the names of the sublist elements and their orders, therefore the column names and order of the expected data.frame can only be determined intrinsically from mylist
Each sublist contains elements in types of numeric, character or list. To simplify the problem, consider only numeric and character.
A shorter solution in base R would be
make_df <- function(a = NA, b = NA, z = NA) {
data.frame(a = unlist(a), b = unlist(b), z = unlist(z))
}
do.call(rbind, lapply(mylist, function(x) do.call(make_df, x)))
#> a b z
#> 1 1 2 <NA>
#> 2 3 NA <NA>
#> 3 NA 5 <NA>
#> 4 9 NA k
Update
A more general solution using the same method, but which does not require specific names would be:
build_data_frame <- function(obj) {
nms <- unique(unlist(lapply(obj, names)))
frmls <- as.list(setNames(rep(NA, length(nms)), nms))
dflst <- setNames(lapply(nms, function(x) call("unlist", as.symbol(x))), nms)
make_df <- as.function(c(frmls, call("do.call", "data.frame", dflst)))
do.call(rbind, lapply(mylist, function(x) do.call(make_df, x)))
}
This allows
build_data_frame(mylist)
#> a b z
#> 1 1 2 <NA>
#> 2 3 NA <NA>
#> 3 NA 5 <NA>
#> 4 9 NA k
We can try the base R code below
subset(
Reduce(
function(...) {
merge(..., all = TRUE)
},
Map(
function(k, x) cbind(id = k, list2DF(x)),
seq_along(mylist), mylist
)
),
select = -id
)
which gives
a b z
1 1 2 NA
2 3 NA NA
3 NA 5 NA
4 9 NA k
You can do something like the following:
mylist <- list(list(a=1,b=2), list(a=3), list(b=5), list(a=9, z='k'))
convert <- function(mylist){
col_names <- NULL
# get all the unique names and create the df
for(i in 1:length(mylist)){
col_names <- c(col_names, names(mylist[[i]]))
}
col_names <- unique(col_names)
df <- data.frame(matrix(ncol=length(col_names),
nrow=length(mylist)))
colnames(df) <- col_names
# join data to row in df
for(i in 1:length(mylist)){
for(j in 1:length(mylist[[i]])){
df[i, names(mylist[[i]])[j]] <- mylist[[i]][names(mylist[[i]])[j]]
}
}
return(df)
}
df <- convert(mylist)
> df
a b z
1 1 2 <NA>
2 3 NA <NA>
3 NA 5 <NA>
4 9 NA k
I've got a solution. Note this only uses the pipe, and could be exchanged for native pipe, etc.
mylist %>%
#' first, ensure that the 2nd level is flat,
lapply(. %>% lapply(FUN = unlist, recursive = FALSE)) %>%
#' replace missing vars with `NA`
lapply(function(x, vars) {
x[vars[!vars %in% names(x)]]<-NA
x
}, vars = {.} %>% unlist() %>% names() %>% unique()) %>%
do.call(what = rbind) %>%
#' do nothing
identity()
In {.} it is meant to define and evaluate the function formed by unlist followed by names. Otherwise . %>% unlist() %>% names() just defines the function, and not evaluate on the input ..

Rollapply with expanding window in R

Let's say I have a simple toy vector in R like:
x = seq(1:10);x
[1] 1 2 3 4 5 6 7 8 9 10
I want to use the rollapply function from zoo package but in a different way.Rollapply calculates a function from a vector x with width argument to be a rolling window.I want instead of rolling to be expanding.There is similar question here and here but they don't help me with my problem.
For example what I want to calculate the sum of the first observations of vector x and then expand the window but by 2.
Doing so I did :
rollapplyr(x, seq_along(x) ,sum,by=2,partial = 5,fill=NA)
[1] NA NA NA NA 15 21 28 36 45 55
or replace the NA's
na.locf0(rollapplyr(x, 5 ,sum,by=2,partial = 5,fill=NA))
[1] NA NA NA NA 15 15 25 25 35 35
But what I ideally want as a result is:
[1] NA NA NA NA 15 15 28 28 45 45
Imagine that my dataset is huge (contains 2500 time series observations) and the function is some econometric - statistical model not a simple one like the sum that I use here.
How can I do it? Any help ?
x <- seq(10)
expandapply <- function(x, start, by, FUN){
# set points to apply function up to
checkpoints <- seq(start, length(x), by)
# apply function to all windows
vals <- sapply(checkpoints, function(i) FUN(x[seq(i)]))
# fill in numeric vector at these points (assumes output is numeric)
out <- replace(rep(NA_real_, length(x)), checkpoints, vals)
# forward-fill the gaps
zoo::na.locf(out, na.rm = FALSE)
}
expandapply(x, start = 5, by = 2, FUN = sum)
#> [1] NA NA NA NA 15 15 28 28 45 45
Created on 2022-03-13 by the reprex package (v2.0.1)
Define nonNA as the positions which should not be NA. You can change x and nonNA to whatever you need.
Then assign w a vector of widths to use using zero for those components which are to be NA. Finally apply na.locf0.
(The two extreme cases are that if nonNA is seq_along(x) so that all elements are not to be NA'd out then this is the same as rollapplyr(x, seq_along(x), sum) and if nonNA is c() so that there are no non-NAs then it returns all NAs.)
library(zoo)
x <- 1:10
nonNA <- seq(5, length(x), 2)
w <- ifelse(seq_along(x) %in% nonNA, seq_along(x), 0)
na.locf0(rollapplyr(x, w, function(x) if (length(x)) sum(x) else NA, fill=NA))
## [1] NA NA NA NA 15 15 28 28 45 45
Another way is to use a list for thewidth= argument of rollapply whose components contain the offsets. x and nonNA are from above.
L <- lapply(seq_along(x), function(x) if (x %in% nonNA) -seq(x-1, 0))
na.locf0(rollapplyr(x, L, sum, fill = NA))
## [1] NA NA NA NA 15 15 28 28 45 45
Update
Simplified solution and added second approach.

Subtract rows with numeric values and ignore NAs

I have several data frames containing 18 columns with approx. 50000 rows. Each row entry represents a measurement at a specific site (= column), and the data contain NA values.
I need to subtract the consecutive rows per column (e.g. row(i+1)-row(i)) to detect threshold values, but I need to ignore (and retain) the NAs, so that only the entries with numeric values are subtracted from each other.
I found very helpful posts with data.table solutions for a single column Iterate over a column ignoring but retaining NA values in R, and for multiple column operations (e.g. Summarizing multiple columns with dplyr?).
However, I haven't managed to combine the approaches suggested in SO (i.e. apply diff over multiple columns and ignore the NAs)
Here's an example df for illustration and a solution I tried:
library(data.table)
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
that's how it works for a single column
diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but for more columns at once
and that's how I apply a diff function over several columns with lapply
diff_all <- setDT(df)[,lapply(.SD, diff)] # not exactly what I want because NAs are not ignored and the difference between numeric values is not calculated
I'd appreciate any suggestion (base, data.table, dplyr ,... solutions) on how to implement a valid !is.na or similar statement into this second line of code very much.
Defining a helper function makes things a bit cleaner:
lag_diff <- function(x) {
which_nna <- which(!is.na(x))
out <- rep(NA_integer_, length(x))
out[which_nna] <- x[which_nna] - shift(x[which_nna])
out
}
cols <- c("x", "y", "z")
setDT(df)
df[, paste0("lag_diff_", cols) := lapply(.SD, lag_diff), .SDcols = cols]
Result:
# x y z lag_diff_x lag_diff_y lag_diff_z
# 1: 1 NA 6 NA NA NA
# 2: 2 4 2 1 NA -4
# 3: 3 5 7 1 1 5
# 4: NA 6 14 NA 1 7
# 5: NA NA 20 NA NA 6
# 6: 9 15 NA 6 9 NA
# 7: 8 14 NA -1 -1 NA
# 8: 7 13 2 -1 -1 -18
So you are looking for:
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
setDT(df)
# diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
df[, lapply(.SD, lag_d)]
or
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
as.data.frame(lapply(df, lag_d))

Using apply() but getting class list answer

I have a series of columns in a data.frame of which I'd like to get the last value, excluding any NAs. The function I'm using to get this done is
last_value <- function(x) tail(x[!is.na(x)], 1)
I'm using apply() to work this function across the 13 columns, for each observation (by row).
df$LastVal<-apply(df[,c(116, 561, 1006, 1451, 1896, 2341, 2786, 3231,
3676, 4121, 4566, 5011, 5456)], 1, FUN=last_value)
My problem is that the output comes out as a list of 5336 (total observations), instead of just a vector of the last values by row. The answers seem to be there but again, in list form. I've used this function before and it's worked fine. When I str() my columns, they're all integers.
Could this function get tripped up if there are no values and only NAs?
I should add that when I unlist() the new variable, I get an error that says "replacement has 4649 rows, data has 5336", so I do think this might have something to do with NAs.
First, you need to see what is the output of the function last_value as you have defined it with a row of NA values.
last_value <- function(x) tail(x[!is.na(x)], 1)
df <- matrix(1:24, 4)
df[2, ] <- NA
df <- as.data.frame(df)
apply(df, 1, last_value)
#[[1]]
#V6
#21
#
#[[2]]
#named integer(0)
#
#[[3]]
#V6
#23
#
#[[4]]
#V6
#24
The problem is that the second member of this list is of length zero. This means that unlist will not solve the problem.
You have to test for a value of length zero.
last_value <- function(x) {
y <- tail(x[!is.na(x)], 1)
if(length(y) == 0) NA else y
}
apply(df, 1, last_value)
#[1] 21 NA 23 24
You could include your function into a selection.
Example
df <- as.data.frame(matrix(1:12, 3, 4))
> df
V1 V2 V3 V4
1 1 4 7 10
2 2 5 8 11
3 3 6 9 12
last_value <- function(x) tail(x[!is.na(x)], 1)
> df[, last_value(c(3, 4))] # selects last column
[1] 10 11 12
Test with NA.
df[2, 4] <- NA
> df[, last_value(c(3, 4))]
[1] 10 NA 12
If you're in need of an apply() approach use #Rui Barradas' well explained answer. Case you depend on speed, consider the benchmark of both solutions:
Unit: microseconds
expr min lq mean median uq max neval cld
apply(df, 1, last_value) 166.095 172.6005 182.09241 177.449 188.2925 257.179 100 b
df[, last_value(c(3, 4))] 32.147 33.4230 36.12764 34.699 35.5920 131.396 100 a
Apropos—for column wise use sapply().
> sapply(df[, c(3, 4)], FUN=last_value)
V3 V4
9 12

How to combine two columns of a data-frame with missing data? [duplicate]

This question already has answers here:
How to implement coalesce efficiently in R
(9 answers)
Coalesce two string columns with alternating missing values to one
(7 answers)
Closed 5 years ago.
This is an extension of this earlier question. How can I combine two columns of a data frame as
data <- data.frame('a' = c('A','B','C','D','E'),
'x' = c("t",2,NA,NA,NA),
'y' = c(NA,NA,NA,4,"r"))
displayed as
'a' 'x' 'y'
A t NA
B 2 NA
C NA NA
D NA 4
E NA r
to get
'a' 'mycol'
A t
B 2
C NA
D 4
E r
I tried this
cbind(data[1], mycol = na.omit(unlist(data[-1])))
But it obviously doesn't keep the NA row.
You could do it by using ifelse, like this:
data$mycol <- ifelse(!is.na(data$x), data$x, data$y)
> data
## a x y mycol
## 1 A 1 NA 1
## 2 B 2 NA 2
## 3 C NA NA NA
## 4 D NA 4 4
## 5 E NA 5 5
Going with your logic, you can do following:
cbind(data[1], mycol = unlist(apply(data[2:3], 1, function(i) ifelse(
length(is.na(i))==length(i),
na.omit(i),
NA)
)))
# a mycol
#1 A 1
#2 B 2
#3 C NA
#4 D 4
#5 E 5
This has been addressed here indirectly. Here is a simple solution based on that:
data$mycol <- coalesce(data$x, data$y)
Extending the answer to any number of columns, and using the neat max.col() function I've discovered thanks to this question:
coalesce <- function(value_matrix) {
value_matrix <- as.matrix(value_matrix)
first_non_missing <- max.col(!is.na(value_matrix), ties.method = "first")
indices <- cbind(
row = seq_len(nrow(value_matrix)),
col = first_non_missing
)
value_matrix[indices]
}
data$mycol <- coalesce(data[, c('x', 'y')])
data
# a x y mycol
# 1 A 1 NA 1
# 2 B 2 NA 2
# 3 C NA NA NA
# 4 D NA 4 4
# 5 E NA 5 5
max.col(..., ties.method = "first") returns, for each row, the index of the first column with the maximum value. Since we're using it on a logical matrix, the max is usually TRUE. So we'll get the first non-NA value for each row. If the entire row is NA, then we'll get an NA value as desired.
After that, the function uses a matrix of row-column indices to subset the values.
Edit
In comparison to mrip's coalesce, my max.col is slower when there are a few long columns, but faster when there are many short columns.
coalesce_reduce <- function(...) {
Reduce(function(x, y) {
i <- which(is.na(x))
x[i] <- y[i]
x},
list(...))
}
coalesce_maxcol <- function(...) {
value_matrix <- cbind(...)
first_non_missing <- max.col(!is.na(value_matrix), ties.method = "first")
indices <- cbind(
row = seq_len(nrow(value_matrix)),
col = first_non_missing
)
value_matrix[indices]
}
set.seed(100)
wide <- replicate(
1000,
{sample(c(NA, 1:10), 10, replace = TRUE)},
simplify = FALSE
)
long <- replicate(
10,
{sample(c(NA, 1:10), 1000, replace = TRUE)},
simplify = FALSE
)
microbenchmark(
do.call(coalesce_reduce, wide),
do.call(coalesce_maxcol, wide),
do.call(coalesce_reduce, long),
do.call(coalesce_maxcol, long)
)
# Unit: microseconds
# expr min lq mean median uq max neval
# do.call(coalesce_reduce, wide) 1879.460 1953.5695 2136.09954 2007.303 2152.654 5284.583 100
# do.call(coalesce_maxcol, wide) 403.604 423.5280 490.40797 433.641 456.583 2543.580 100
# do.call(coalesce_reduce, long) 36.829 41.5085 45.75875 43.471 46.942 79.393 100
# do.call(coalesce_maxcol, long) 80.903 88.1475 175.79337 92.374 101.581 3438.329 100

Resources