I have several data frames containing 18 columns with approx. 50000 rows. Each row entry represents a measurement at a specific site (= column), and the data contain NA values.
I need to subtract the consecutive rows per column (e.g. row(i+1)-row(i)) to detect threshold values, but I need to ignore (and retain) the NAs, so that only the entries with numeric values are subtracted from each other.
I found very helpful posts with data.table solutions for a single column Iterate over a column ignoring but retaining NA values in R, and for multiple column operations (e.g. Summarizing multiple columns with dplyr?).
However, I haven't managed to combine the approaches suggested in SO (i.e. apply diff over multiple columns and ignore the NAs)
Here's an example df for illustration and a solution I tried:
library(data.table)
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
that's how it works for a single column
diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but for more columns at once
and that's how I apply a diff function over several columns with lapply
diff_all <- setDT(df)[,lapply(.SD, diff)] # not exactly what I want because NAs are not ignored and the difference between numeric values is not calculated
I'd appreciate any suggestion (base, data.table, dplyr ,... solutions) on how to implement a valid !is.na or similar statement into this second line of code very much.
Defining a helper function makes things a bit cleaner:
lag_diff <- function(x) {
which_nna <- which(!is.na(x))
out <- rep(NA_integer_, length(x))
out[which_nna] <- x[which_nna] - shift(x[which_nna])
out
}
cols <- c("x", "y", "z")
setDT(df)
df[, paste0("lag_diff_", cols) := lapply(.SD, lag_diff), .SDcols = cols]
Result:
# x y z lag_diff_x lag_diff_y lag_diff_z
# 1: 1 NA 6 NA NA NA
# 2: 2 4 2 1 NA -4
# 3: 3 5 7 1 1 5
# 4: NA 6 14 NA 1 7
# 5: NA NA 20 NA NA 6
# 6: 9 15 NA 6 9 NA
# 7: 8 14 NA -1 -1 NA
# 8: 7 13 2 -1 -1 -18
So you are looking for:
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
setDT(df)
# diff_x <- df[!is.na(x), lag_diff := x - shift(x)] # actually what I want, but
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
df[, lapply(.SD, lag_d)]
or
library("data.table")
df <- data.frame(x=c(1:3,NA,NA,9:7),y=c(NA,4:6, NA,15:13), z=c(6,2,7,14,20, NA, NA, 2))
lag_d <- function(x) { y <- x[!is.na(x)]; x[!is.na(x)] <- y - shift(y); x }
as.data.frame(lapply(df, lag_d))
Related
I am currently trying to find unique elements between two columns of a data frame and write these to a new final data frame.
This is my code, which works perfectly fine, and creates a result which matches my expectation.
set.seed(42)
df <- data.frame(a = sample(1:15, 10),
b=sample(1:15, 10))
unique_to_a <- df$a[!(df$a %in% df$b)]
unique_to_b <- df$b[!(df$b %in% df$a)]
n <- max(c(unique_to_a, unique_to_b))
out <- data.frame(A=rep(NA,n), B=rep(NA,n))
for (element in unique_to_a){
out[element, "A"] = element
}
for (element in unique_to_b){
out[element, "B"] = element
}
out
The problem is, that it is very slow, because the real data contains 100.000s of rows. I am quite sure it is because of the repeated indexing I am doing in the for loop, and I am sure there is a quicker, vectorized way, but I dont see it...
Any ideas on how to speed up the operation is much appreciated.
Cheers!
Didn't compare the speed but at least this is more concise:
elements <- with(df, list(setdiff(a, b), setdiff(b, a)))
data.frame(sapply(elements, \(x) replace(rep(NA, max(unlist(elements))), x, x)))
# X1 X2
# 1 NA NA
# 2 NA NA
# 3 NA 3
# 4 NA NA
# 5 NA NA
# 6 NA NA
# 7 NA NA
# 8 NA NA
# 9 NA NA
# 10 NA NA
# 11 11 NA
I want to create multiple columns in a dataframe that each calculate a different value based on values from an existing column.
Say I have the following dataframe:
date <- c('1','2','3','4','5')
close <- c('10','20','15','13','19')
test_df <- data.frame(date,close)
I want to create a new column that does the following operation with dplyr:
test_df %>%
mutate(logret = log(close / lag(close, n=1)))
However I would like to create a new column for multiple values of n such that I have columns:
logret1 for n=1,
logret2 for n=2,
logret3 for n=3
etc...
I've used the function seq(from=1, to=5, by=1) as an example to get a vector of numbers to replace n with. I've tried to create a for loop around the mutate function:
seq2 <- seq(from=1, to=5, by=1)
for (number in seq2){
new_df <- test_df %>%
mutate(logret = log(close/lag(close, n=seq2)))
}
However I get the error:
Error: Problem with `mutate()` input `logret`. x `n` must be a nonnegative integer scalar, not a double vector of length 5. i Input `logret` is `log(close2/lag(close2, n = seq2))`.
I realise I can't pass in a vector for n, however I am stuck on how to proceed.
Any help would be much appreciated, Thanks.
You can use purrr's map_dfc to add new columns :
library(dplyr)
library(purrr)
n <- 3
bind_cols(test_df, map_dfc(1:n, ~test_df %>%
transmute(!!paste0('logret', .x) := log(close / lag(close, n=.x)))))
# date close logret1 logret2 logret3
#1 1 10 NA NA NA
#2 2 20 0.6931472 NA NA
#3 3 15 -0.2876821 0.4054651 NA
#4 4 13 -0.1431008 -0.4307829 0.26236426
#5 5 19 0.3794896 0.2363888 -0.05129329
data
test_df <- data.frame(date,close)
test_df <- type.convert(test_df)
You can use data.table. It's an R package that provides an enhanced version of data.frame. This is an awesome resource to get started with https://www.machinelearningplus.com/data-manipulation/datatable-in-r-complete-guide/
library(data.table)
#Create data.table
test_dt <- data.table(date, close)
#Define the new cols names
logret_cols <- paste0('logret', 1:3)
#Create new columns
test_dt[, (logret_cols) := lapply(1:3, function(n) log(close / lag(close, n = n)))]
test_dt
# date close logret1 logret2 logret3
#1: 1 10 NA NA NA
#2: 2 20 0.6931472 NA NA
#3: 3 15 -0.2876821 0.4054651 NA
#4: 4 13 -0.1431008 -0.4307829 0.26236426
#5: 5 19 0.3794896 0.2363888 -0.05129329
data.table has an interesting way to deal with memory efficiently. If you will deal with large amount of data, take a look at this benchmarks, are awesome:
https://h2oai.github.io/db-benchmark/
EDIT
You can even do it with a mix of data.table and purrr. Here's an example using the function purrr::map()
test_dt[, (logret_cols) := map(1:3, ~log(close / lag(close, n = .x)))]
test_dt
# date close logret1 logret2 logret3
#1: 1 10 NA NA NA
#2: 2 20 0.6931472 NA NA
#3: 3 15 -0.2876821 0.4054651 NA
#4: 4 13 -0.1431008 -0.4307829 0.26236426
#5: 5 19 0.3794896 0.2363888 -0.05129329
I have a 640 x 2500 dataframe with numeric values and several NA values. My goal is to find a minimum of 75 consecutive NA values in each row. For each such run, I want to replace the previous and following 50 cells with NA values too.
Here's a scaled down example of one row:
x <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
# run of four NA: ^ ^ ^ ^
I want to detect the run of four consecutive NA, and then replace three values before and three values after the run with NA:
c(1, 3, 4, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2, 4, 3)
# ^ ^ ^ ^ ^ ^
I have tried to first identify the consecutive NAs with rle, but running rle(is.na(df)) gives the error 'x' must be a vector of an atomic type. This occurs even when I select a single row.
Unfortunately, I do not know what the next steps to take would be in converting the previous and following 50 cells to NA.
Would highly appreciate any help on this, thanks in advance.
Because you comment that in your data "some [rows] begin and end with several NAs", hopefully this better represents the real data:
A B C D E F G H I J
1 1 2 3 NA NA 6 7 8 NA 10
2 1 NA NA NA 5 6 7 NA NA NA
3 1 2 3 4 NA NA NA 8 9 10
Let's assume that the minimum run length of NA to be expanded with NA is 2, and that two values before and two values after the run should be replaced with NA. In this example, row 2 would represent the case you mentioned in comment.
First some data wrangling. I prefer to work with a data.table in long format. With data.table we have access to the useful constants .I and .N, and can easily create run IDs with rleid.
# convert data.frame to data.table
library(data.table)
setDT(d)
# set minimum length of runs to be expanded
len = 2L
# set number of values to replace on each side of run
n = 2L
# number of columns of original data (for truncation of indices)
nc = ncol(d)
# create a row index to keep track of the original rows in the long format
d[ , ri := 1:.N]
# melt from wide to long format
d2 = melt(d, id.vars = "ri")
# order by row index
setorder(d2, ri)
Now the actual calculations on the runs and their indices:
d2[
# check if the run is an "NA run" and has sufficient length
d2[ , if(anyNA(value) & .N >= len){
# get indices before and after run, where values should be changed to NA
ix = c(.I[1] - n:1L, .I[.N] + 1L:n)
# truncate indices to keep them within (original) rows
ix[ix >= 1 + (ri - 1) * nc & ix <= nc * ri]},
# perform the calculation by row index and run index
# grab replacement indices
by = .(ri, rleid(is.na(value)))]$V1,
# at replacement indices, set value to NA
value := NA]
If desired, cast back to wide format
dcast(d2, ri ~ variable, value.vars = "value")
# ri A B C D E F G H I J
# 1: 1 1 NA NA NA NA NA NA 8 NA 10
# 2: 2 NA NA NA NA NA NA NA NA NA NA
# 3: 3 1 2 NA NA NA NA NA NA NA 10
Type coercion worked for me:
rle(as.logical(is.na(x[MyRow, ])))
Here is my solution for this. I wonder if there is a tidier solution than mine though.
library(data.table)
df <- matrix(nrow = 1,ncol = 16)
df[1,] <- c(1, 3, 4, 5, 4, 3, NA, NA, NA, NA, 6, 9, 3, 2, 4, 3)
df <- df %>%
as.data.table() # dataset created
# A function to do what you need
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths == 4)]))] # Locate the position of those NAs that are repeated exactly 4 times
if(length(NAs == 4)){ # Check if there are a stretch of 4 WAs
Vector[seq(NAs[1]-3,
NAs[1]-1,1)] <- NA # this part deals with the 3 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+3,1)] <- NA # this part deals with the 3 positions occuring after the last NA
}
Vector
}
> df # the original dataset
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 5 4 3 NA NA NA NA 6 9 3 2 4 3
# the transformed dataset
apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16
1: 1 3 4 NA NA NA NA NA NA NA NA NA NA 2 4 3
As an aside, the speed is quite good for a hypothetical dataframe sized 640*2500 where a stretch of 75 or more NAs have to be located and the 50 values before and after must be replaced with an NA.
df <- matrix(nrow = 640,ncol = 2500)
for(i in 1:nrow(df)){
df[i,] <- c(1:100,rep(NA,75),rep(1,2325))
}
NA_replacer <- function(x){
Vector <- unlist(x) # pull the values into a vector
NAs <- which(is.na(Vector)) # locate the positions of the NAs
NAs_Position_1 <- cumsum(c(1, diff(NAs) - 1)) # Find those that are in sequential order
NAs_Position_2 <- rle(NAs_Position_1) # Find their values
NAs <- NAs[which(
NAs_Position_1 == with(NAs_Position_2,
values[which(
lengths >= 75)]))] # Locate the position of those NAs that are repeated exactly 75 times or more than 75 times
if(length(NAs >= 75)){ # Check if the condition is met
Vector[seq(NAs[1]-50,
NAs[1]-1,1)] <- NA # this part deals with the 50 positions occuring before the first NA
Vector[seq(NAs[length(NAs)]+1,
NAs[length(NAs)]+50,1)] <- NA # this part deals with the 50 positions occuring after the last NA
}
Vector
}
# Check how many NAs are present in the first row of the dataset prior to applying the function
which(is.na(df %>%
as_tibble() %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 75
df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose()
# Check how many NAs are present in the first row post applying the function
which(is.na(df %>%
slice(1) %>%
unlist())) %>% # run the code till here to get the indices of the NAs
length()
[1] 175
system.time(df <- apply(df, 1, function(x) NA_replacer(x)) %>%
as.data.table() %>%
data.table::transpose())
user system elapsed
0.216 0.002 0.220
I would like to do a last observation carried forward for a variable, but only up to 2 observations. That is, for gaps of data of 3 or more NA, I would only carry the last observation forward for the next 2 observations and leave the rest as NA.
If I do this with the zoo::na.locf, the maxgap parameter implies that if the gap is larger than 2, no NA is replaced. Not even the last 2. Is there any alternative?
x <- c(NA,3,4,5,6,NA,NA,NA,7,8)
zoo::na.locf(x, maxgap = 2) # Doesn't replace the first 2 NAs of after the 6 as the gap of NA is 3.
Desired_output <- c(NA,3,4,5,6,6,6,NA,7,8)
First apply na.locf0 with maxgap = 2 giving x0 and define a grouping variable g using rleid from the data.table package. For each such group use ave to apply keeper which if the group is all NA replaces it with c(1, 1, NA, ..., NA) and otherwise outputs all 1s. Multiply na.locf0(x) by that.
library(data.table)
library(zoo)
mg <- 2
x0 <- na.locf0(x, maxgap = mg)
g <- rleid(is.na(x0))
keeper <- function(x) if (all(is.na(x))) ifelse(seq_along(x) <= mg, 1, NA) else 1
na.locf0(x) * ave(x0, g, FUN = keeper)
## [1] NA 3 4 5 6 6 6 NA 7 8
A solution using base R:
ave(x, cumsum(!is.na(x)), FUN = function(i){ i[1:pmin(length(i), 3)] <- i[1]; i })
# [1] NA 3 4 5 6 6 6 NA 7 8
cumsum(!is.na(x)) groups each run of NAs with most recent non-NA value.
function(i){ i[1:pmin(length(i), 3)] <- i[1]; i } transforms the first two NAs of each group into the leading non-NA value of this group.
This question already has answers here:
There is pmin and pmax each taking na.rm, why no psum?
(3 answers)
Closed 6 years ago.
I'll just understand a (for me) weird behavior of the function rowSums. Imagine I have this super simple dataframe:
a = c(NA, NA,3)
b = c(2,NA,2)
df = data.frame(a,b)
df
a b
1 NA 2
2 NA NA
3 3 2
and now I want a third column that is the sum of the other two. I cannot use simply + because of the NA:
df$c <- df$a + df$b
df
a b c
1 NA 2 NA
2 NA NA NA
3 3 2 5
but if I use rowSums the rows that have NA are calculated as 0, while if there is only one NA everything works fine:
df$d <- rowSums(df, na.rm=T)
df
a b c d
1 NA 2 NA 2
2 NA NA NA 0
3 3 2 5 10
am I missing something?
Thanks to all
One option with rowSums would be to get the rowSums with na.rm=TRUE and multiply with the negated (!) rowSums of negated (!) logical matrix based on the NA values after converting the rows that have all NAs into NA (NA^)
rowSums(df, na.rm=TRUE) *NA^!rowSums(!is.na(df))
#[1] 2 NA 10
Because
sum(numeric(0))
# 0
Once you used na.rm = TRUE in rowSums, the second row is numeric(0). After taking sum, it is 0.
If you want to retain NA for all NA cases, it would be a two-stage work. I recommend writing a small function for this purpose:
my_rowSums <- function(x) {
if (is.data.frame(x)) x <- as.matrix(x)
z <- base::rowSums(x, na.rm = TRUE)
z[!base::rowSums(!is.na(x))] <- NA
z
}
my_rowSums(df)
# [1] 2 NA 10
This can be particularly useful, if the input x is a data frame (as in your case). base::rowSums would first check whether input is matrix or not. If it gets a data frame, it would convert it into a matrix first. Type conversion is in fact more costly than actual row sum computation. Note that we call base::rowSums two times. To reduce type conversion overhead, we should make sure x is a matrix beforehand.
For #akrun's "hacking" answer, I suggest:
akrun_rowSums <- function (x) {
if (is.data.frame(x)) x <- as.matrix(x)
rowSums(x, na.rm=TRUE) *NA^!rowSums(!is.na(x))
}
akrun_rowSums(df)
# [1] 2 NA 10