Issue with local variables in r custom function - r

I've got a dataset
>view(interval)
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 2 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
>dput(interval)
structure(list(V1 = c(NA, 2, 3, 4, NA),
V2 = c(1, 2, NA, 2, 5),
V3 = c(2, 3, 1, 2, 1), ID = 1:5), row.names = c(NA, -5L), class = "data.frame")
I would like to extract the previous not NA value (or the next, if NA is in the first row) for every row, and store it as a local variable in a custom function, because I have to perform other operations on every row based on this value(which should change for every row i'm applying the function).
I've written this function to print the local variables, but when I apply it the output is not what I want
myFunction<- function(x){
position <- as.data.frame(which(is.na(interval), arr.ind=TRUE))
tempVar <- ifelse(interval$ID == 1, interval[position$row+1,
position$col], interval[position$row-1, position$col])
return(tempVar)
}
I was expecting to get something like this
# [1] 2
# [2] 2
# [3] 4
But I get something pretty messed up instead.

Here's attempt number 1:
dat <- read.table(header=TRUE, text='
V1 V2 V3 ID
NA 1 2 1
2 2 3 2
3 NA 1 3
4 2 2 4
NA 5 1 5')
myfunc1 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# catch first-row NA
ind[,1] <- ifelse(ind[,1] == 1L, 2L, ind[,1] - 1L)
x[ind]
}
myfunc1(dat)
# [1] 2 2 4
The problem with this is when there is a second "stacked" NA:
dat2 <- dat
dat2[2,1] <- NA
dat2
# V1 V2 V3 ID
# 1 NA 1 2 1
# 2 NA 2 3 2
# 3 3 NA 1 3
# 4 4 2 2 4
# 5 NA 5 1 5
myfunc1(dat2)
# [1] NA NA 2 4
One fix/safeguard against this is to use zoo::na.locf, which takes the "last observation carried forward". Since the top-row is a special case, we do it twice, second time in reverse. This gives us the "next non-NA value in the column (up or down, depending).
library(zoo)
myfunc2 <- function(x) {
ind <- which(is.na(x), arr.ind=TRUE)
# since it appears you want them in row-first sorted order
ind <- ind[order(ind[,1], ind[,2]),]
# this is to guard against stacked NA
x <- apply(x, 2, zoo::na.locf, na.rm = FALSE)
# this special-case is when there are one or more NAs at the top of a column
x <- apply(x, 2, zoo::na.locf, fromLast = TRUE, na.rm = FALSE)
x[ind]
}
myfunc2(dat2)
# [1] 3 3 2 4

Related

How to verify if when a column is NA the other is not?

I have a dataframe with two columns. I need to check if where a column is NA the other is not. Thanks
Edited.
I would like to know, for each row of the dataframe, if there are rows with both columns not NA.
You can use the following code to check which row has no NA values:
df <- data.frame(x = c(1, NA),
y = c(2, NA))
which(rowSums(is.na(df))==ncol(df))
Output:
[1] 1
As you can see the first rows has no NA values so both columns have no NA values.
Here's a simple code to generate a column of the NA count for each row:
x <- sample(c(1, NA), 25, replace = TRUE)
y <- sample(c(1, NA), 25, replace = TRUE)
df <- data.frame(x, y)
df$NA_Count <- apply(df, 1, function(x) sum(is.na(x)))
df
x y NA_Count
1 NA 1 1
2 NA NA 2
3 1 NA 1
4 1 NA 1
5 NA NA 2
6 1 NA 1
7 1 1 0
8 1 1 0
9 1 1 0

how to add a new row with extra column in R?

I was trying to add results of a for loop into a dataframe as new rows, but it gets an error when there is a new result with more columns than the original dataframe, how could I add the new result with extra columns to the dataframe with adding the extra column names to the original dataframe?
e.g.
original dataframe:
-______A B C
x1 1 1 1
x2 2 2 2
x3 3 3 3
I want to get
-______A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
X4 4 4 4 4
I tried rbind (Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match)
and rbind_fill (Error: All inputs to rbind.fill must be data.frames)
and bind_rows (Argument 2 must have names)
In base R, this can be done by creating a new column 'D' with NA and then assign new row with 4.
df1$D <- NA
df1['x4', ] <- 4
-output
> df1
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Or in a single line
rbind(cbind(df1, D = NA), x4 = 4)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
x4 4 4 4 4
Regarding the error in bind_rows, it happens when the for loop output is not a named vector
library(dplyr)
> vec1 <- c(4, 4, 4, 4)
> bind_rows(df1, vec1)
Error: Argument 2 must have names.
Run `rlang::last_error()` to see where the error occurred.
If it is a named vector, then it should work
> vec1 <- c(A = 4, B = 4, C = 4, D = 4)
> bind_rows(df1, vec1)
A B C D
x1 1 1 1 NA
x2 2 2 2 NA
x3 3 3 3 NA
...4 4 4 4 4
data
df1 <- structure(list(A = 1:3, B = 1:3, C = 1:3),
class = "data.frame", row.names = c("x1",
"x2", "x3"))
You probably have something like this, if you list the elements of your for loop.
(l <- list(x1, x2, x3, x4, x5))
# [[1]]
# [1] 1 1 1
#
# [[2]]
# [1] 2 2 2 2
#
# [[3]]
# [1] 3 3
#
# [[4]]
# [1] 4
#
# [[5]]
# NULL
Multiple elements can be rbinded using a do.call(rbind, .) approach, your problem is, how to rbind multiple elements that differ in length.
There's a `length<-` function with which you may adjust the length of a vector. To know to which length, there's another function, lengths, that gives you the lengths of each list element, where you are interested in the maximum.
I include the special case when an element has length NULL (our 5th element of l); since length of NULL cannot be changed, replace those elements with NA.
So altogether you may do:
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, max(lengths(l))))
# [,1] [,2] [,3] [,4]
# [1,] 1 1 1 NA
# [2,] 2 2 2 2
# [3,] 3 3 NA NA
# [4,] 4 NA NA NA
# [5,] NA NA NA NA
Or, since you probably want a data frame with pretty row and column names:
ml <- max(lengths(l))
do.call(rbind, lapply(replace(l, lengths(l) == 0L, NA), `length<-`, ml)) |>
as.data.frame() |> `dimnames<-`(list(paste0('x', 1:length(l)), LETTERS[1:ml]))
# A B C D
# x1 1 1 1 NA
# x2 2 2 2 2
# x3 3 3 NA NA
# x4 4 NA NA NA
# x5 NA NA NA NA
Note: R >= 4.1 used.
Data:
x1 <- rep(1, 3); x2 <- rep(2, 4); x3 <- rep(3, 2); x4 <- rep(4, 1); x5 <- NULL

replace missing values using other rows only when other columns are the same in R

I guess that other people have already looked for it but couldn't find what I'm looking for.
I want to replace NA values with the value of the row above, only when all other values are the same. Bonus point for data.table solution.
Right now, I've managed to do it only with a (very inefficient) loop.
In addition, my current code does not replace NA in case that there are two NA's in the same row.
I have a strong feeling that I'm overthinking this problem. Any ideas of making this stuff easier?
ex <- data.table(
id = c(1, 1, 2, 2),
attr1 = c(NA, NA, 3, 3),
attr2 = c(2, 2, NA, 3),
attr3 = c(NA, 2, 2, 1),
attr4 = c(1, 1, 1, 3)
)
desired_ex <- data.table(
id = c(1, 1, 2, 2),
attr1 = c(NA, NA, 3, 3),
attr2 = c(2, 2, NA, 3),
attr3 = c(2, 2, 2, 1),
attr4 = c(1, 1, 1, 3)
)
col_names <- paste0("attr", 1:4)
r<-1
for (r in 1:nrow(ex)) {
print(r)
to_check <- col_names[colSums(is.na(ex[r, .SD, .SDcols = col_names])) >0]
if (length(to_check) == 0) {
print("no NA- next")
next
}
for (col_check in to_check) {
.ex <- copy(ex)[seq(from = r, to = r + 1), ]
.ex[[col_check]] <- NULL
if (nrow(unique(.ex)) == 1) {
ex[[col_check]][r] <- ex[[col_check]][r + 1]
}
}
}
all.equal(ex, desired_ex)
Here is a solution which will work for an arbitrary number of rows and columns within each id not just pairs of rows:
library(data.table)
ex[,
if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE))))) {
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N))
} else {
.SD
}, by = id]
or, more compact,
ex[, if (all(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))))
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by = id]
id attr1 attr2 attr3 attr4
1: 1 NA 2 2 1
2: 1 NA 2 2 1
3: 2 3 NA 2 1
4: 2 3 3 1 3
Explanation
For each id it is checked if the rows fulfill the condition. If not .SD is returned unchanged. If the condition is fulfilled a new .SD is created by picking the first non-NA value in each column (or NA in case of all NA) using fcoalesce() and replicating this value as many times as there are rows in .SD.
The check for the condition consists of 2 parts. First, it is checked for each column in .SD if all values are identical thereby ignoring any NA. Finally, it is checked if this is TRUE for all columns.
Note that .SD is a data.table containing the Subset of Data for each group, excluding any columns used in by.
Another use case with more rows and columns
ex2 <- fread("
id foo bar baz attr4 attr5
1 NA 2 NA 1 5
1 NA 2 2 1 NA
1 NA 2 NA NA NA
2 3 NA 2 1 2
2 3 3 1 3 2
2 3 3 1 4 2
3 5 2 NA 1 3
3 NA 2 2 1 3
4 NA NA NA NA NA
")
ex2[, if (sum(unlist(lapply(.SD, \(x) all(first(x) == x, na.rm = TRUE)))) == ncol(.SD))
lapply(.SD, \(x) rep(fcoalesce(as.list(x)), .N)) else .SD, by = id]
id foo bar baz attr4 attr5
1: 1 NA 2 2 1 5
2: 1 NA 2 2 1 5
3: 1 NA 2 2 1 5
4: 2 3 NA 2 1 2
5: 2 3 3 1 3 2
6: 2 3 3 1 4 2
7: 3 5 2 2 1 3
8: 3 5 2 2 1 3
9: 4 NA NA NA NA NA
Here is an option mixing base R with data.table:
#lead the values for comparison
cols <- paste0("attr", 1L:4L)
lcols <- paste0("lead_", cols)
ex[, (lcols) := shift(.SD, -1L), id]
#check which rows fulfill the criteria
flags <- apply(ex[, ..cols] == ex[, ..lcols], 1L, all, na.rm=TRUE) &
apply(ex[, ..lcols], 1L, function(x) !all(is.na(x)))
#update those rows with values from row below
ex[(flags), (cols) :=
mapply(function(x, y) fcoalesce(x, y), mget(lcols), mget(cols), SIMPLIFY=FALSE)]
ex[, (lcols) := NULL][]
Solution assumes that there is no recursive populating where the row after next is used to fill the current row if criteria is met.

Follow-up: Separate columns with constant numbers and condense them to one row in R data.frame

This question is a follow-up on my previous question. In this question, after my split.default() call below, I get a named list of data.frames called L.
Qs: I was wondering how I could condense each data.frame in L whose each column consists of a constant number? (How about if I know the names of the data.frames whose columns are constant numbers?)
My desired output is shown further below.
r <- list(
data.frame(study.name = rep("Jacob", 6),
X = c(2,2,1,1,NA, NA),
Y = c(1,1,1,2,1,NA),
A = rep(1, 6),
B = rep(4, 6)),
data.frame(study.name = rep("Jon", 6),
X = c(1,NA,3,1,NA,NA),
G = c(1,1,1,2,NA,NA),
A = rep(3, 6),
B = rep(7, 6)))
DATA <- do.call(cbind, r)
nm1 <- Reduce(intersect, lapply(r, colnames))[-1]
L <- split.default(DATA[names(DATA) %in% nm1], names(DATA)[names(DATA) %in% nm1])
Desired output:
# $A
# A A.1
# 1 1 3
# $B
# B B.1
# 1 4 7
# $X
# X X.1
# 1 2 1
# 2 2 NA
# 3 1 3
# 4 1 1
# 5 NA NA
# 6 NA NA
Assuming that the NA rows should be preserved, apply duplicated by looping over the list as well as if all the elements of a particular are NA, then keep that row
lapply(L, function(x) x[(rowSums(is.na(x)) == ncol(x))|!duplicated(x),])
#$A
# A A.1
#1 1 3
#$B
# B B.1
#1 4 7
#$X
# X X.1
#1 2 1
#2 2 NA
#3 1 3
#4 1 1
#5 NA NA
#6 NA NA
If we also need a check for constant value
is_constant <- function(x) length(unique(x)) == 1L
lapply(L, function(x) if(all(sapply(x, is_constant))) x[1,, drop = FALSE] else x)
#$A
# A A.1
#1 1 3
#$B
# B B.1
#1 4 7
#$X
# X X.1
#1 2 1
#2 2 NA
#3 1 3
#4 1 1
#5 NA NA
#6 NA NA

How to ignore case when using subset in R

How to ignore case when using subset function in R?
eos91corr.data <- subset(test.data,select=c(c(X,Y,Z,W,T)))
I would like to select columns with names x,y,z,w,t. what should i do?
Thanks
If you can live without the subset() function, the tolower() function may work:
dat <- data.frame(XY = 1:5, x = 1:5, mm = 1:5,
y = 1:5, z = 1:5, w = 1:5, t = 1:5, r = 1:5)
dat[,tolower(names(dat)) %in% c("xy","x")]
However, this will return a data.frame with the columns in the order they are in the original dataset dat: both
dat[,tolower(names(dat)) %in% c("xy","x")]
and
dat[,tolower(names(dat)) %in% c("x","xy")]
will yield the same result, although the order of the target names has been reversed.
If you want the columns in the result to be in the order of the target vector, you need to be slightly more fancy. The two following commands both return a data.frame with the columns in the order of the target vector (i.e., the results will be different, with columns switched):
dat[,sapply(c("x","xy"),FUN=function(foo)which(foo==tolower(names(dat))))]
dat[,sapply(c("xy","x"),FUN=function(foo)which(foo==tolower(names(dat))))]
You could use regular expressions with the grep function to ignore case when identifying column names to select. Once you have identified the desired column names, then you can pass these to subset.
If your data are
dat <- data.frame(xy = 1:5, x = 1:5, mm = 1:5, y = 1:5, z = 1:5,
w = 1:5, t = 1:5, r = 1:5)
# xy x mm y z w t r
# 1 1 1 1 1 1 1 1 1
# 2 2 2 2 2 2 2 2 2
# 3 3 3 3 3 3 3 3 3
# 4 4 4 4 4 4 4 4 4
# 5 5 5 5 5 5 5 5 5
Then
(selNames <- grep("^[XYZWT]$", names(dat), ignore.case = TRUE, value = TRUE))
# [1] "x" "y" "z" "w" "t"
subset(dat, select = selNames)
# x y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
EDIT If your column names are longer than one letter, the above approach won't work too well. So assuming you can get your desired column names in a vector, you could use the following:
upperNames <- c("XY", "Y", "Z", "W", "T")
(grepPattern <- paste0("^", upperNames, "$", collapse = "|"))
# [1] "^XY$|^Y$|^Z$|^W$|^T$"
(selNames2 <- grep(grepPattern, names(dat), ignore.case = TRUE, value = TRUE))
# [1] "xy" "y" "z" "w" "t"
subset(dat, select = selNames2)
# xy y z w t
# 1 1 1 1 1 1
# 2 2 2 2 2 2
# 3 3 3 3 3 3
# 4 4 4 4 4 4
# 5 5 5 5 5 5
The 'stringr' library is a very neat wrapper for all of this functionality. It has 'ignore.case' option as follows:
also, you may want to consider using match not subset.

Resources