I have a large data.frame with 'staggered' data and would like to align it. What I mean is I would like to take something like
and remove the leading (top) NAs from all columns to get
I know about the na.trim function from the zoo package, but this didn't work on either the initial data.frame presented above or its transpose. For this I used, with transposed dataframe t.df,
t.df <- na.trim(t.df, sides = 'left')
This only returned an empty data.frame, and wouldn't work the way I wanted anyway since it would create vectors of different lengths. Can anyone point me to a package or function that might be more helpful?
Here is the code for my example used above:
# example of what I have
var1 <- c(1,2,3,4,5,6,7,8,9,10)
var2 <- c(6,2,4,7,3,NA,NA,NA,NA,NA)
var3 <- c(NA,NA,8,6,3,7,NA,NA,NA,NA)
var4 <- c(NA,NA,NA,NA,5,NA,2,6,2,9)
df <- data.frame(var1, var2, var3, var4)
# transpose and (unsuccessful) attempt to remove leading NAs
t.df <- t(df)
t.df <- na.trim(t.df, sides = 'left')
We can loop over the columns (lapply(..) and apply na.trim. Then, pad NAs at the end of the each of the list elements by assigning length as the maximum length from the list elements.
library(zoo)
lst <- lapply(df, na.trim)
df[] <- lapply(lst, `length<-`, max(lengths(lst)))
df
# var1 var2 var3 var4
#1 1 6 8 5
#2 2 2 6 NA
## 3 4 3 2
#4 4 7 7 6
#5 5 3 NA 2
#6 6 NA NA 9
#7 7 NA NA NA
#8 8 NA NA NA
#9 9 NA NA NA
#10 10 NA NA NA
Or as #G.Grothendieck mentioned in the comments
replace(df, TRUE, do.call("merge", lapply(lst, zoo)))
You can do with base functions:
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
x[c(((r$length[1]+1):length(x)), 1:r$length[1])]
}
df[,] <- lapply(df, my.na.trim)
df
# var1 var2 var3 var4
# 1 1 6 8 5
# 2 2 2 6 NA
# 3 3 4 3 2
# 4 4 7 7 6
# 5 5 3 NA 2
# 6 6 NA NA 9
# 7 7 NA NA NA
# 8 8 NA NA NA
# 9 9 NA NA NA
# 10 10 NA NA NA
alternative coding for the function:
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
r1 <- r$length[1]
c(tail(x, -r1), head(x, r1))
}
We can use the cbind.na() function from the qpcR package and combine it with the na.trim() function from the zoo package:
do.call(qpcR:::cbind.na, lapply(df, zoo::na.trim))
# var1 var2 var3 var4
# [1,] 1 6 8 5
# [2,] 2 2 6 NA
# [3,] 3 4 3 2
# [4,] 4 7 7 6
# [5,] 5 3 NA 2
# [6,] 6 NA NA 9
# [7,] 7 NA NA NA
# [8,] 8 NA NA NA
# [9,] 9 NA NA NA
#[10,] 10 NA NA NA
If speed is a matter you can use this data.table solution.
library(data.table)
dt_foo <- function(dt) {
shift_v <- sapply(dt, function(col) min(which(+(is.na(col)) == 0))-1)
shift_expr <- parse(text = paste0("list(", paste("shift(", names(shift_v), ", n = ", shift_v, ", type = 'lead')", collapse = ", "), ")"))
dt[, names(shift_v) := eval(shift_expr), with = F]
dt[]
}
Some benchmarking follows.
library(zoo)
library(microbenchmark)
set.seed(1)
DT <- as.data.table(matrix(sample(c(0:9L, NA), 1e8, T, prob = c(rep(.01, 10), .9)), ncol = 1000))
zoo_foo <- function(df) {
lst <- lapply(df, na.trim)
df[] <- lapply(lst, `length<-`, max(lengths(lst)))
df
}
my.na.trim <- function(x) {
r <- rle(is.na(x))
if (!r$value[1]) return(x)
x[c(((r$length[1]+1):length(x)), 1:r$length[1])]
}
microbenchmark(dt_foo(copy(DT)), zoo_foo(DT),
as.data.frame(lapply(DT, my.na.trim)), times = 10)
Unit: seconds
expr min lq mean median uq max neval cld
dt_foo(copy(DT)) 1.468749 1.618289 1.690293 1.699926 1.725534 1.893018 10 a
zoo_foo(DT) 6.493227 6.516247 6.834768 6.779045 7.190705 7.319058 10 c
as.data.frame(lapply(DT, my.na.trim)) 4.988514 5.013340 5.384399 5.385273 5.508889 6.517748 10 b
Related
I'd like to remove the NA values from my columns, merge all columns into four columns, while keeping NA's if there is not 4 values in each row.
Say I have data like this,
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
#> a b c d e f g
#> 1 1 3 NA 4 NA 1 NA
#> 2 4 NA 2 2 5 NA NA
#> 3 NA 3 NA NA 3 NA NA
#> 4 3 NA NA NA NA 4 4
My desired outcome would be,
df.desired <- data.frame('a' = c(1,4,3,3),
'b' = c(3,2,3,4),
'c' = c(4,2,NA,4),
'd' = c(1,5,NA,NA))
df.desired
#> a b c d
#> 1 1 3 4 1
#> 2 4 2 2 5
#> 3 3 3 NA NA
#> 4 3 4 4 NA
You could've probably explored a bit more on SO to tweak two answers 1 & 2.
Shifting all the Numbers with NAs
Remove the columns where you've got All NAs
Result:
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
df.new<-do.call(rbind,lapply(1:nrow(df),function(x) t(matrix(df[x,order(is.na(df[x,]))])) ))
colnames(df.new)<-colnames(df)
df.new
df.new[,colSums(is.na(df.new))<nrow(df.new)]
Output:
> df.new[,colSums(is.na(df.new))<nrow(df.new)]
a b c d
[1,] 1 3 4 1
[2,] 4 2 2 5
[3,] 3 3 NA NA
[4,] 3 4 4 NA
I believe there are more efficient ways, anyhow that is my try:
x00=sapply(1:nrow(df),function(x) df[x,][!is.na( df[x,])])
x01=lapply(x00,function(x) x=c(x,rep(NA,7-length(x)-1)))
x02=as.data.frame(do.call("rbind",x01))
x02 <- x02[,colSums(is.na(x02))<nrow(x02)]
I have following solution:
df <- data.frame('a' = c(1,4,NA,3),
'b' = c(3,NA,3,NA),
'c' = c(NA,2,NA,NA),
'd' = c(4,2,NA,NA),
'e'= c(NA,5,3,NA),
'f'= c(1,NA,NA,4),
'g'= c(NA,NA,NA,4))
df
x <-list()
for(i in 1:nrow(df)){
x[[i]] <- df[i,]
x[[i]] <- x[[i]][!is.na(x[[i]])]
# x[[i]] <- as.data.frame(x[[i]], stringsAsFactors = FALSE)
x[[i]] <- c(x[[i]], rep(0, 5 -length(x[[i]])))
}
result <- do.call(rbind, x)
result
My data is structured as follows:
set.seed(20)
RawData <- data.frame(Trial = c(rep(1, 10), rep(2, 10)),
X_Velocity = runif(20, 1, 3),
Y_Velocity = runif(20, 4, 6))
I now wish to calculate an average for X_Velocity and Y_Velocity across every two rows, for each Trial. My anticipated output, for the first four rows would be:
X_Velocity_AVG Y_Velocity_AVG
NA NA
2.6460545 4.522224
NA NA
1.8081265 4.5175165
How do I complete this?
You could do this using function f in which the average of every two elements is computed:
f <- function(a) tapply(a, rep(1:(length(a)/2), each = 2), FUN = mean)
res <- data.frame(X_Velocity_AVG=rep(NA, nrow(RawData)),
Y_Velocity_AVG=rep(NA, nrow(RawData)))
res$X_Velocity_AVG[c(F,T)] <- f(RawData$X_Velocity)
res$Y_Velocity_AVG[c(F,T)] <- f(RawData$Y_Velocity)
# X_Velocity_AVG Y_Velocity_AVG
# 1 NA NA
# 2 2.646055 4.522224
# 3 NA NA
# 4 1.808127 4.517517
# 5 NA NA
# 6 2.943262 4.334551
# 7 NA NA
# 8 1.162082 5.899396
# 9 NA NA
# 10 1.697668 4.739195
# 11 NA NA
# 12 2.473324 4.778723
# 13 NA NA
# 14 1.744730 5.020097
# 15 NA NA
# 16 1.644518 4.986245
# 17 NA NA
# 18 1.431219 5.375815
# 19 NA NA
# 20 2.108719 4.909284
I have the following data frames:
df_1 <- data.frame(f1= c(1,3,4,5,7,8), f2 = c(2,3,4,1,4,5))
df_2 <- data.frame(f1= c(0.1,0.3,0.04,0.015,0.7,0.8), f2 = c(0.02,0.13,0.4,1.4,0.04,0.5))
so they look like
> df_1
f1 f2
1 1 2
2 3 3
3 4 4
4 5 1
5 7 4
6 8 5
> df_2
f1 f2
1 0.100 0.02
2 0.300 0.13
3 0.040 0.40
4 0.015 1.40
5 0.700 0.04
6 0.800 0.50
The replacement I wish to perform is:
If one figure in df2 is higher than 0.05, I wish to replace the figure in df1 at the corresponding position with NA. The resulting data frame df1 should look like
f1 f2
1 NA 2
2 NA NA
3 4 NA
4 5 NA
5 NA 4
6 NA NA
I have tried to solve it using a for loop but it will take a long time when applied to my actual large table. I know there could be a quicker way using data.table but I don't actually know how. Can someone help me with this?
You can do it like this
> df_1[df_2 > 0.05] <- NA
> df_1
f1 f2
1 NA 2
2 NA NA
3 4 NA
4 5 NA
5 NA 4
6 NA NA
If performance is a issue, we can use set from data.table (also the OP mentioned data.table in the post). Using set will be fast as the overhead of [.data.table is avoided.
library(data.table)
setDT(df_1)
for(j in seq_along(df_1)){
set(df_1, i = which(df_2[[j]] > 0.05), j = j, value = NA)
}
df_1
# f1 f2
#1: NA 2
#2: NA NA
#3: 4 NA
#4: 5 NA
#5: NA 4
#6: NA NA
Benchmarks
set.seed(49)
df1 <- data.frame(f1 = sample(1:9, 1e7, replace=TRUE),
f2 = sample(1:9, 1e7, replace=TRUE))
set.seed(24)
df2 <- data.frame(f1 = rnorm(1e7), f2 = rnorm(1e7))
akrun <- function() {DT <- as.data.table(df1)
for(j in seq_along(DT)){
set(DT, i = which(df2[[j]] > 0.05), j=j, value = NA)
}
}
David <- function() {df1[df2 > 0.05] <- NA}
library(microbenchmark)
microbenchmark(akrun(), David(), unit="relative", times = 20L)
# expr min lq mean median uq max neval
# akrun() 1.000000 1.00000 1.000000 1.000000 1.000000 1.000000 20
# David() 2.487825 2.65275 2.428343 2.582355 2.298318 2.126138 20
I have a R data.table
DT = data.table(x=rep(c("b","a",NA_character_),each=3), y=rep(c('A', NA_character_, 'C'), each=3), z=c(NA_character_), v=1:9)
DT
# x y z v
#1: b A NA 1
#2: b A NA 2
#3: b A NA 3
#4: a NA NA 4
#5: a NA NA 5
#6: a NA NA 6
#7: NA C NA 7
#8: NA C NA 8
#9: NA C NA 9
For each column if the value is not NA, I want to extract the max value from column v. I am using
sapply(DT, function(x) { ifelse(all(is.na(x)), NA_integer_, max(DT[['v']][!is.na(x)])) })
#x y z v
#6 9 NA 9
Is there a simpler way to achive this?
here is a way, giving you -Inf (and a warning) if all values of the column are NA (you can later replace that by NA if you prefer):
DT[, lapply(.SD, function(x) max(v[!is.na(x)]))]
# x y z v
# 1: 6 9 -Inf 9
As suggested by #DavidArenburg, to ensure that everything goes well even when all values are NA (no warning and directly NA as result), you can do:
DT[, lapply(.SD, function(x) {
temp <- v[!is.na(x)]
if(!length(temp)) NA else max(temp)
})]
# x y z v
#1: 6 9 NA 9
We can use summarise_each from dplyr
library(dplyr)
DT %>%
summarise_each(funs(max(v[!is.na(.)])))
# x y z v
#1: 6 9 -Inf 9
Consider the following named vector x.
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
# a b c d e f g h
# 1 2 0 NA 4 NA NA 6
I'd like to calculate the cumulative sum of x while ignoring the NA values. Many R functions have an argument na.rm which removes NA elements prior to calculations. cumsum() is not one of them, which makes this operation a bit tricky.
I can do it this way.
y <- setNames(numeric(length(x)), names(x))
z <- cumsum(na.omit(x))
y[names(y) %in% names(z)] <- z
y[!names(y) %in% names(z)] <- x[is.na(x)]
y
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
But this seems excessive, and makes a lot of new assignments/copies. I'm sure there's a better way.
What better methods are there to return the cumulative sum while effectively ignoring NA values?
You can do this in one line with:
cumsum(ifelse(is.na(x), 0, x)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
Or, similarly:
library(dplyr)
cumsum(coalesce(x, 0)) + x*0
# a b c d e f g h
# 1 3 3 NA 7 NA NA 13
It's an old question but tidyr gives a new solution.
Based on the idea of replacing NA with zero.
require(tidyr)
cumsum(replace_na(x, 0))
a b c d e f g h
1 3 3 3 7 7 7 13
Do you want something like this:
x2 <- x
x2[!is.na(x)] <- cumsum(x2[!is.na(x)])
x2
[edit] Alternatively, as suggested by a comment above, you can change NA's to 0's -
miss <- is.na(x)
x[miss] <- 0
cs <- cumsum(x)
cs[miss] <- NA
# cs is the requested cumsum
Here's a function I came up from the answers to this question. Thought I'd share it, since it seems to work well so far. It calculates the cumulative FUNC of x while ignoring NA. FUNC can be any one of sum(), prod(), min(), or max(), and x is a numeric vector.
cumSkipNA <- function(x, FUNC)
{
d <- deparse(substitute(FUNC))
funs <- c("max", "min", "prod", "sum")
stopifnot(is.vector(x), is.numeric(x), d %in% funs)
FUNC <- match.fun(paste0("cum", d))
x[!is.na(x)] <- FUNC(x[!is.na(x)])
x
}
set.seed(1)
x <- sample(15, 10, TRUE)
x[c(2,7,5)] <- NA
x
# [1] 4 NA 9 14 NA 14 NA 10 10 1
cumSkipNA(x, sum)
# [1] 4 NA 13 27 NA 41 NA 51 61 62
cumSkipNA(x, prod)
# [1] 4 NA 36 504 NA 7056 NA
# [8] 70560 705600 705600
cumSkipNA(x, min)
# [1] 4 NA 4 4 NA 4 NA 4 4 1
cumSkipNA(x, max)
# [1] 4 NA 9 14 NA 14 NA 14 14 14
Definitely nothing new, but maybe useful to someone.
Another option is using the collapse package with fcumsum function like this:
( x <- setNames(c(1, 2, 0, NA, 4, NA, NA, 6), letters[1:8]) )
#> a b c d e f g h
#> 1 2 0 NA 4 NA NA 6
library(collapse)
fcumsum(x)
#> a b c d e f g h
#> 1 3 3 NA 7 NA NA 13
Created on 2022-08-24 with reprex v2.0.2